CS4132 Data Analytics
Tourism has always been a huge part of entertainment, for people to relax and enjoy themselves. Tourism is an important source of income and employment for developed and developing countries. Unfortunately, COVID-19 disrupted this sector immensely and the tourism industry plummeted.
Tourism can be regarded as a social, cultural and economic phenomenon related to the movement of people outside their usual place of residence. Some different types of tourism include domestic tourism, inbound tourism, and outbound tourism. Domestic tourism comprises the activities of a resident visitor within the country of reference. Inbound tourism comprises the activities of a non-resident visitor within the country of reference. Finally, outbound tourism is comprises of the activities of a resident visitor outside the country of reference.
In this project, I will be analysing the number of departures and arrivals. Through this project, I would like to know every country's popularity in outbound, inbound, and domestic tourism. Through this, we can learn the popularity and reputation of different countries in other countries. I will only be analysing the data from 1995 to 2019, before the COVID-19 pandemic. I will also be finding correlations of the popularity with other variables.
First of all, I will be including all the necessary imports here.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import csv
import folium as folium
import pycountry
import plotly.express as px #pip install plotly==5.10.0 OR conda install -c plotly plotly=5.10.0
from scipy import stats
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
I am reading the files.
#1
data = pd.read_excel('unwto-all-data-download_0.xlsx', sheet_name=None, header = None)
#2 PDF loaded in Appendix
ranking_19 = pd.read_csv('overall_rankings_2019.csv')
#3 JSON loaded in Appendix
region = pd.read_csv('region.csv')
#4 Retrieved in Appendix
countries_interest = pd.read_csv('countries_interest.csv')
#5 Web scraped in Appendix
countries = pd.read_csv('countries.csv', index_col=0)
#6
area = pd.read_csv('API_AG.LND.TOTL.K2_DS2_en_csv_v2_4546125.csv', names = range(67))
I will be analysing 3 main datasets: outbound departures, inbound arrivals, and domestic trips.
I will first roughly cleaning the datasets to only include all the data that I may need. After that, I will further clean the data by picking and joining to match what I want to analyse.
Firstly, I will be roughly cleaning outbound departures.
#Getting dataframe for outbound departures
outbound_departures = data['Outbound Tourism-Departures']
#Dropping redundant data
outbound_departures = outbound_departures.drop([0,1])
outbound_departures = outbound_departures.drop(outbound_departures.tail(4).index)
outbound_departures = outbound_departures.iloc[: , :-1]
outbound_departures = outbound_departures.drop([0,1,2,4,7,9], axis = 1)
outbound_departures = outbound_departures.reset_index(drop=True)
outbound_departures.columns = outbound_departures.iloc[0]
outbound_departures = outbound_departures[1:]
outbound_departures = outbound_departures[outbound_departures.iloc[:,4]!='Departures']
outbound_departures.iloc[:,0].ffill(inplace = True)
outbound_departures = outbound_departures[outbound_departures.Units.notna()]
#Replacing '..' with NaN
outbound_departures = outbound_departures.replace('..', np.nan)
outbound_departures.iloc[:,1:3] = outbound_departures.iloc[:,1:3].ffill(axis=1)
#Renaming the columns
new_columns = list(outbound_departures.columns.values)
new_columns[0] = 'Countries'
new_columns[1] = 'placeholder'
new_columns[2] = 'Indicators'
outbound_departures.columns = new_columns
outbound_departures = outbound_departures.drop('placeholder', axis = 1)
outbound_departures = outbound_departures.reset_index(drop=True)
#Setting the index
outbound_departures.index = [np.array(outbound_departures['Countries']), np.array(outbound_departures['Indicators'])]
outbound_departures = outbound_departures.drop(['Countries', 'Indicators'], axis = 1)
#Renaming the index
outbound_departures.index.names = ['Countries', 'Indicators']
outbound_departures
| Units | 1995 | 1996 | 1997 | 1998 | 1999 | 2000 | 2001 | 2002 | 2003 | ... | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Countries | Indicators | |||||||||||||||||||||
| AFGHANISTAN | Total departures | Thousands | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Overnights visitors (tourists) | Thousands | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| Same-day visitors (excursionists) | Thousands | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| ALBANIA | Total departures | Thousands | NaN | NaN | NaN | NaN | NaN | NaN | 955.0 | 1303.0 | 1350.0 | ... | 4120.0 | 3959.0 | 3928.0 | 4146.0 | 4504.0 | 4852.0 | 5186.0 | 5415.0 | 5922.0 | 2907.0 |
| Overnights visitors (tourists) | Thousands | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| ZAMBIA | Overnights visitors (tourists) | Thousands | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Same-day visitors (excursionists) | Thousands | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| ZIMBABWE | Total departures | Thousands | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Overnights visitors (tourists) | Thousands | 256.0 | 69.0 | 123.0 | 213.0 | 331.0 | NaN | NaN | NaN | 386.0 | ... | 693.0 | 720.0 | 2946.0 | 3182.0 | 3393.0 | 3192.0 | 2768.0 | 2288.0 | 3275.0 | NaN | |
| Same-day visitors (excursionists) | Thousands | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
669 rows × 27 columns
Secondly, I will be roughly cleaning the inbound arrivals.
#Getting dataframe for inbound departures
inbound_arrivals = data[' Inbound Tourism-Arrivals']
#Dropping redundant data
inbound_arrivals = inbound_arrivals.drop([0,1])
inbound_arrivals = inbound_arrivals.drop(inbound_arrivals.tail(8).index)
inbound_arrivals = inbound_arrivals.iloc[: , :-1]
inbound_arrivals = inbound_arrivals.drop([0,1,2,4,9,10], axis = 1)
inbound_arrivals = inbound_arrivals.reset_index(drop=True)
inbound_arrivals.columns = inbound_arrivals.iloc[0]
inbound_arrivals = inbound_arrivals[1:]
inbound_arrivals = inbound_arrivals[inbound_arrivals.iloc[:,4]!='Arrivals']
inbound_arrivals.iloc[:,0].ffill(inplace = True)
inbound_arrivals = inbound_arrivals[inbound_arrivals.Units.notna()]
#Replacing '..' with NaN
inbound_arrivals = inbound_arrivals.replace('..', np.nan)
inbound_arrivals.iloc[:,1:4] = inbound_arrivals.iloc[:,1:4].ffill(axis=1)
#Renaming the columns
new_columns = list(inbound_arrivals.columns.values)
new_columns[0] = 'Countries'
new_columns[1] = 'placeholder'
new_columns[2] = 'placeholder2'
new_columns[3] = 'Indicators'
inbound_arrivals.columns = new_columns
inbound_arrivals = inbound_arrivals.drop(['placeholder', 'placeholder2'], axis = 1)
inbound_arrivals = inbound_arrivals.reset_index(drop=True)
#Setting the index
inbound_arrivals.index = [np.array(inbound_arrivals['Countries']), np.array(inbound_arrivals['Indicators'])]
#Removing Indicators that are 'of which, cruise passengers' as it is a subset of 'Same-day visitors (excursionists)'
inbound_arrivals = inbound_arrivals[inbound_arrivals.Indicators != 'of which, cruise passengers']
inbound_arrivals = inbound_arrivals.drop(['Countries', 'Indicators'], axis = 1)
#Renaming the index
inbound_arrivals.index.names = ['Countries', 'Indicators']
inbound_arrivals
| Units | 1995 | 1996 | 1997 | 1998 | 1999 | 2000 | 2001 | 2002 | 2003 | ... | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Countries | Indicators | |||||||||||||||||||||
| AFGHANISTAN | Total arrivals | Thousands | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Overnights visitors (tourists) | Thousands | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| Same-day visitors (excursionists) | Thousands | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| ALBANIA | Total arrivals | Thousands | 304.0 | 287.0 | 119.0 | 184.0 | 371.0 | 317.0 | 354.0 | 470.0 | 557.0 | ... | 2932.0 | 3514.0 | 3256.0 | 3673.0 | 4131.0 | 4736.0 | 5118.0 | 5927.0 | 6406.0 | 2658.0 |
| Overnights visitors (tourists) | Thousands | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 2469.0 | 3156.0 | 2857.0 | 3341.0 | 3784.0 | 4070.0 | 4643.0 | 5340.0 | 6128.0 | 2604.0 | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| ZAMBIA | Overnights visitors (tourists) | Thousands | 163.0 | 264.0 | 341.0 | 362.0 | 404.0 | 457.0 | 492.0 | 565.0 | 413.0 | ... | 920.0 | 859.0 | 915.0 | 947.0 | 932.0 | 956.0 | 1009.0 | 1072.0 | 1266.0 | 502.0 |
| Same-day visitors (excursionists) | Thousands | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| ZIMBABWE | Total arrivals | Thousands | 1416.0 | 1597.0 | 1336.0 | 2090.0 | 2250.0 | 1967.0 | 2217.0 | 2041.0 | 2256.0 | ... | 2423.0 | 1794.0 | 1833.0 | 1880.0 | 2057.0 | 2168.0 | 2423.0 | 2580.0 | 2294.0 | 639.0 |
| Overnights visitors (tourists) | Thousands | 1363.0 | 1577.0 | 1281.0 | 1986.0 | 2101.0 | 1868.0 | 2068.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| Same-day visitors (excursionists) | Thousands | 53.0 | 20.0 | 55.0 | 104.0 | 149.0 | 99.0 | 149.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
669 rows × 27 columns
Similar to outbound departures and inbound arrivals, I roughly cleaned up domestic trips.
#Getting dataframe for inbound departures
domestic_trips = data['Domestic Tourism-Trips']
#Dropping redundant data
domestic_trips = domestic_trips.drop([0,1])
domestic_trips = domestic_trips.drop(domestic_trips.tail(4).index)
domestic_trips = domestic_trips.iloc[: , :-1]
domestic_trips = domestic_trips.drop([0,1,2,4,9], axis = 1)
domestic_trips = domestic_trips.reset_index(drop=True)
domestic_trips.columns = domestic_trips.iloc[0]
domestic_trips = domestic_trips[1:]
domestic_trips = domestic_trips[domestic_trips.iloc[:,4]!='Trips']
domestic_trips.iloc[:,0].ffill(inplace = True)
domestic_trips = domestic_trips[domestic_trips.Units.notna()]
#Replacing '..' with NaN
domestic_trips = domestic_trips.replace('..', np.nan)
domestic_trips.iloc[:,1:4] = domestic_trips.iloc[:,1:4].ffill(axis=1)
#Renaming the columns
new_columns = list(domestic_trips.columns.values)
new_columns[0] = 'Countries'
new_columns[1] = 'placeholder'
new_columns[2] = 'placeholder2'
new_columns[3] = 'Indicators'
domestic_trips.columns = new_columns
domestic_trips = domestic_trips.drop(['placeholder', 'placeholder2'], axis = 1)
domestic_trips = domestic_trips.reset_index(drop=True)
#Setting the index
domestic_trips.index = [np.array(domestic_trips['Countries']), np.array(domestic_trips['Indicators'])]
domestic_trips = domestic_trips.drop(['Countries', 'Indicators'], axis = 1)
#Renaming the index
domestic_trips.index.names = ['Countries', 'Indicators']
domestic_trips
| Units | 1995 | 1996 | 1997 | 1998 | 1999 | 2000 | 2001 | 2002 | 2003 | ... | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Countries | Indicators | |||||||||||||||||||||
| AFGHANISTAN | Total trips | Thousands | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Overnights visitors (tourists) | Thousands | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| Same-day visitors (excursionists) | Thousands | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| ALBANIA | Total trips | Thousands | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Overnights visitors (tourists) | Thousands | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| ZAMBIA | Overnights visitors (tourists) | Thousands | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Same-day visitors (excursionists) | Thousands | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| ZIMBABWE | Total trips | Thousands | NaN | NaN | NaN | NaN | 35233.0 | 32468.0 | 29178.0 | 22109.0 | 19894.0 | ... | 15878.0 | 16327.0 | 13431.0 | 13781.0 | 16427.0 | 16377.0 | 15707.0 | 15180.0 | 20991.0 | NaN |
| Overnights visitors (tourists) | Thousands | NaN | NaN | NaN | NaN | 20427.0 | 18824.0 | 16917.0 | 12818.0 | 11534.0 | ... | 9206.0 | 9466.0 | 7787.0 | 7990.0 | 9524.0 | 9495.0 | 9106.0 | 8801.0 | 12157.0 | NaN | |
| Same-day visitors (excursionists) | Thousands | NaN | NaN | NaN | NaN | 14806.0 | 13644.0 | 12261.0 | 9291.0 | 8360.0 | ... | 6672.0 | 6861.0 | 5644.0 | 5791.0 | 6903.0 | 6882.0 | 6600.0 | 6379.0 | 8834.0 | NaN |
669 rows × 27 columns
Here I have the rankings of countries in different areas in 2019, with a overall ranking too. I am using 2019 as the latest ranking as I am analysing tourism before the pandemic.
I am only dropping the irrelevant column.
#Dropping redundant data
ranking_19 = ranking_19.drop('Unnamed: 0', axis=1)
ranking_19
| Overall Rank | Country | Entrepreneurship | Adventure | Citizenship | Cultural Influence | Heritage | Movers | Open for Business | Power | Quality of Life | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Switzerland | 4 | 17 | 3 | 9 | 27 | 25 | 2 | 14 | 5 |
| 1 | 2 | Japan | 1 | 39 | 17 | 6 | 10 | 5 | 22 | 7 | 13 |
| 2 | 3 | Canada | 6 | 19 | 2 | 12 | 42 | 39 | 7 | 12 | 1 |
| 3 | 4 | Germany | 2 | 57 | 12 | 11 | 20 | 34 | 21 | 4 | 10 |
| 4 | 5 | United Kingdom | 5 | 40 | 11 | 5 | 12 | 53 | 23 | 5 | 12 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 70 | 71 | Jordan | 62 | 71 | 74 | 70 | 48 | 50 | 63 | 33 | 77 |
| 71 | 72 | Tunisia | 69 | 60 | 76 | 65 | 53 | 63 | 55 | 63 | 68 |
| 72 | 73 | Belarus | 56 | 61 | 50 | 71 | 67 | 66 | 76 | 35 | 67 |
| 73 | 74 | Nigeria | 67 | 74 | 77 | 63 | 76 | 57 | 58 | 46 | 74 |
| 74 | 75 | Pakistan | 68 | 77 | 78 | 79 | 71 | 56 | 72 | 22 | 73 |
75 rows × 11 columns
Similarly, I am also dropping the irrelevant column for region. Region will be used to group data together based on location.
#Dropping redundant data
region = region.drop('Unnamed: 0', axis=1)
region
| name | alpha-2 | alpha-3 | country-code | iso_3166-2 | region | sub-region | intermediate-region | region-code | sub-region-code | intermediate-region-code | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | AF | AFG | 4 | ISO 3166-2:AF | Asia | Southern Asia | NaN | 142.0 | 34.0 | NaN |
| 1 | Åland Islands | AX | ALA | 248 | ISO 3166-2:AX | Europe | Northern Europe | NaN | 150.0 | 154.0 | NaN |
| 2 | Albania | AL | ALB | 8 | ISO 3166-2:AL | Europe | Southern Europe | NaN | 150.0 | 39.0 | NaN |
| 3 | Algeria | DZ | DZA | 12 | ISO 3166-2:DZ | Africa | Northern Africa | NaN | 2.0 | 15.0 | NaN |
| 4 | American Samoa | AS | ASM | 16 | ISO 3166-2:AS | Oceania | Polynesia | NaN | 9.0 | 61.0 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 244 | Wallis and Futuna | WF | WLF | 876 | ISO 3166-2:WF | Oceania | Polynesia | NaN | 9.0 | 61.0 | NaN |
| 245 | Western Sahara | EH | ESH | 732 | ISO 3166-2:EH | Africa | Northern Africa | NaN | 2.0 | 15.0 | NaN |
| 246 | Yemen | YE | YEM | 887 | ISO 3166-2:YE | Asia | Western Asia | NaN | 142.0 | 145.0 | NaN |
| 247 | Zambia | ZM | ZMB | 894 | ISO 3166-2:ZM | Africa | Sub-Saharan Africa | Eastern Africa | 2.0 | 202.0 | 14.0 |
| 248 | Zimbabwe | ZW | ZWE | 716 | ISO 3166-2:ZW | Africa | Sub-Saharan Africa | Eastern Africa | 2.0 | 202.0 | 14.0 |
249 rows × 11 columns
Here we have the countries and their respective latitude and longitude. I renamed some of the countries' names to match the geodata. However, I found a better geodata after that and renaming is not needed, but I still kept it.
countries = countries.reset_index(drop = True)
#Renaming countries' names to match geodata
countries = countries.replace({'name': {'United States': 'United States of America', 'Bahamas': 'The Bahamas', 'Serbia': 'Republic of Serbia', 'Macedonia [FYROM]': 'Macedonia', 'Myanmar [Burma]': 'Myanmar', 'Guinea-Bissau': 'Guinea Bissau', 'Congo [Republic]': 'Republic of the Congo', 'Tanzania': 'United Republic of Tanzania', 'Timor-Leste': 'East Timor'}})
countries
| country | latitude | longitude | name | |
|---|---|---|---|---|
| 0 | AD | 42.546245 | 1.601554 | Andorra |
| 1 | AE | 23.424076 | 53.847818 | United Arab Emirates |
| 2 | AF | 33.939110 | 67.709953 | Afghanistan |
| 3 | AG | 17.060816 | -61.796428 | Antigua and Barbuda |
| 4 | AI | 18.220554 | -63.068615 | Anguilla |
| ... | ... | ... | ... | ... |
| 240 | YE | 15.552727 | 48.516388 | Yemen |
| 241 | YT | -12.827500 | 45.166244 | Mayotte |
| 242 | ZA | -30.559482 | 22.937506 | South Africa |
| 243 | ZM | -13.133897 | 27.849332 | Zambia |
| 244 | ZW | -19.015438 | 29.154857 | Zimbabwe |
245 rows × 4 columns
I cleaned the data from Google Trends by making Date the index and also changing its dtype to datetime.
#Renaming 'date' into 'Date'
countries_interest = countries_interest.rename(columns={'date':'Date'})
#Setting Date as the index
countries_interest.set_index('Date', inplace=True)
#Changing the data type of date into datetime
countries_interest.index = pd.to_datetime(countries_interest.index)
countries_interest
| Afghanistan | Åland Islands | Albania | American Samoa | Andorra | Angola | Anguilla | Antarctica | Antigua and Barbuda | Argentina | ... | Turks and Caicos Islands | Tuvalu | Uganda | Ukraine | United Arab Emirates | United Kingdom of Great Britain and Northern Ireland | United States of America | United States Minor Outlying Islands | Viet Nam | Zimbabwe | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | |||||||||||||||||||||
| 2004-01-01 | 7 | 80 | 17 | 34 | 95 | 23 | 100 | 74 | 0 | 30 | ... | 43 | 60 | 41 | 2 | 69 | 0 | 48 | 0 | 9 | 29 |
| 2004-02-01 | 7 | 0 | 19 | 29 | 83 | 27 | 68 | 86 | 44 | 30 | ... | 100 | 100 | 42 | 2 | 73 | 0 | 53 | 0 | 11 | 28 |
| 2004-03-01 | 8 | 0 | 18 | 25 | 72 | 25 | 70 | 100 | 32 | 34 | ... | 78 | 63 | 40 | 2 | 77 | 0 | 42 | 100 | 12 | 29 |
| 2004-04-01 | 8 | 0 | 20 | 28 | 49 | 27 | 61 | 81 | 24 | 34 | ... | 28 | 58 | 43 | 2 | 75 | 0 | 45 | 80 | 13 | 31 |
| 2004-05-01 | 8 | 100 | 20 | 27 | 49 | 28 | 68 | 100 | 26 | 34 | ... | 5 | 54 | 43 | 2 | 71 | 0 | 45 | 0 | 11 | 30 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2022-03-01 | 9 | 24 | 37 | 16 | 34 | 30 | 54 | 62 | 37 | 43 | ... | 18 | 39 | 91 | 89 | 34 | 55 | 15 | 11 | 20 | 26 |
| 2022-04-01 | 6 | 35 | 30 | 19 | 32 | 31 | 47 | 54 | 34 | 37 | ... | 18 | 42 | 92 | 33 | 28 | 60 | 16 | 15 | 17 | 25 |
| 2022-05-01 | 6 | 37 | 39 | 18 | 29 | 33 | 40 | 52 | 33 | 37 | ... | 15 | 39 | 89 | 22 | 28 | 61 | 16 | 13 | 20 | 29 |
| 2022-06-01 | 10 | 24 | 43 | 15 | 37 | 32 | 44 | 49 | 38 | 58 | ... | 18 | 33 | 98 | 16 | 30 | 41 | 13 | 18 | 20 | 39 |
| 2022-07-01 | 5 | 25 | 45 | 17 | 37 | 35 | 44 | 49 | 32 | 39 | ... | 17 | 38 | 100 | 13 | 27 | 38 | 12 | 16 | 19 | 46 |
223 rows × 227 columns
I will now clean the data of the area of all countries. I first remove all unnecessary rows and columns, follwed by changing the object data to float.
#Dropping redundant data
area = area.loc[2:, :63]
area = area.T.set_index(2).T.reset_index(drop = True)
area.columns.name = ''
area = area.drop([1960.0, 'Indicator Name', 'Indicator Code'], axis = 1)
#Changing the data type to float
area[np.arange(1961,2020)] = area[np.arange(1961,2020)].astype('float')
area
| Country Name | Country Code | 1961.0 | 1962.0 | 1963.0 | 1964.0 | 1965.0 | 1966.0 | 1967.0 | 1968.0 | ... | 2010.0 | 2011.0 | 2012.0 | 2013.0 | 2014.0 | 2015.0 | 2016.0 | 2017.0 | 2018.0 | 2019.0 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aruba | ABW | 180.0 | 180.0 | 180.0 | 180.0 | 180.0 | 180.0 | 180.0 | 180.0 | ... | 1.800000e+02 | 180.00 | 1.800000e+02 | 1.800000e+02 | 1.800000e+02 | 1.800000e+02 | 1.800000e+02 | 1.800000e+02 | 1.800000e+02 | 1.800000e+02 |
| 1 | Africa Eastern and Southern | AFE | 14571611.0 | 14571611.0 | 14571611.0 | 14571611.0 | 14571611.0 | 14571611.0 | 14571611.0 | 14571611.0 | ... | 1.472096e+07 | 14721240.05 | 1.484517e+07 | 1.484513e+07 | 1.484509e+07 | 1.484514e+07 | 1.484515e+07 | 1.484514e+07 | 1.484515e+07 | 1.484516e+07 |
| 2 | Afghanistan | AFG | 652230.0 | 652230.0 | 652230.0 | 652230.0 | 652230.0 | 652230.0 | 652230.0 | 652230.0 | ... | 6.522300e+05 | 652230.00 | 6.522300e+05 | 6.522300e+05 | 6.522300e+05 | 6.522300e+05 | 6.522300e+05 | 6.522300e+05 | 6.522300e+05 | 6.522300e+05 |
| 3 | Africa Western and Central | AFW | 9046580.0 | 9046580.0 | 9046580.0 | 9046580.0 | 9046580.0 | 9046580.0 | 9046580.0 | 9046580.0 | ... | 9.045780e+06 | 9045780.00 | 9.045780e+06 | 9.045780e+06 | 9.045780e+06 | 9.045780e+06 | 9.045780e+06 | 9.045780e+06 | 9.045780e+06 | 9.045780e+06 |
| 4 | Angola | AGO | 1246700.0 | 1246700.0 | 1246700.0 | 1246700.0 | 1246700.0 | 1246700.0 | 1246700.0 | 1246700.0 | ... | 1.246700e+06 | 1246700.00 | 1.246700e+06 | 1.246700e+06 | 1.246700e+06 | 1.246700e+06 | 1.246700e+06 | 1.246700e+06 | 1.246700e+06 | 1.246700e+06 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 261 | Kosovo | XKX | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 262 | Yemen, Rep. | YEM | 527970.0 | 527970.0 | 527970.0 | 527970.0 | 527970.0 | 527970.0 | 527970.0 | 527970.0 | ... | 5.279700e+05 | 527970.00 | 5.279700e+05 | 5.279700e+05 | 5.279700e+05 | 5.279700e+05 | 5.279700e+05 | 5.279700e+05 | 5.279700e+05 | 5.279700e+05 |
| 263 | South Africa | ZAF | 1213090.0 | 1213090.0 | 1213090.0 | 1213090.0 | 1213090.0 | 1213090.0 | 1213090.0 | 1213090.0 | ... | 1.213090e+06 | 1213090.00 | 1.213090e+06 | 1.213090e+06 | 1.213090e+06 | 1.213090e+06 | 1.213090e+06 | 1.213090e+06 | 1.213090e+06 | 1.213090e+06 |
| 264 | Zambia | ZMB | 743390.0 | 743390.0 | 743390.0 | 743390.0 | 743390.0 | 743390.0 | 743390.0 | 743390.0 | ... | 7.433900e+05 | 743390.00 | 7.433900e+05 | 7.433900e+05 | 7.433900e+05 | 7.433900e+05 | 7.433900e+05 | 7.433900e+05 | 7.433900e+05 | 7.433900e+05 |
| 265 | Zimbabwe | ZWE | 386850.0 | 386850.0 | 386850.0 | 386850.0 | 386850.0 | 386850.0 | 386850.0 | 386850.0 | ... | 3.868500e+05 | 386850.00 | 3.868500e+05 | 3.868500e+05 | 3.868500e+05 | 3.868500e+05 | 3.868500e+05 | 3.868500e+05 | 3.868500e+05 | 3.868500e+05 |
266 rows × 61 columns
Now, I will be further cleaning the data to match what I want to analyse by picking and merging data.\ First, I defined two functions to return the alpha 2 and alpha 3 codes through fuzzy searches of the country names. I used search_fuzzy as some countries have various names. The alpha 2 and alpha 3 codes will be very useful to merge with other data or to plot the data as they are universal, unlike their names.
#Function to find alpha 2 code of country through its name
def findCountry2(country_name):
try:
return pycountry.countries.search_fuzzy(country_name)[0].alpha_2
except:
return None
#Function to find alpha 3 code of country through its name
def findCountry3(country_name):
try:
return pycountry.countries.search_fuzzy(country_name)[0].alpha_3
except:
return None
For the datasets below, I will add up 'Overnights visitors (tourists)' and 'Same-day visitors (excursionists)' to get 'Total departures'. I will be using millions as the units as it is easier to display on the folium choropleth map legend. I will also be forward filling before backward filling the data to replace missing data. This is to replace some missing data in countries, but not affect countries with no data at all. I also chose to analyse from 1995 to 2019 as we cannot reasonably fill the data for 2020 as there is a drastic drop in values. I will be adding a sum to total up all the data for each country over the years. I will also be adding the countries' alpha 2 and alpha 3 codes, before merging the data to also include the region, latitude, and longitude.
#Adding up 'Overnights visitors (tourists)' and 'Same-day visitors (excursionists)' to get 'Total departures'
outbound_2 = outbound_departures.reset_index()
outbound_2 = outbound_2[outbound_2['Indicators'] != 'Total departures']
outbound_2 = outbound_2.groupby('Countries').sum()[np.arange(1995,2020)].reset_index()
outbound_2 = outbound_2.set_index('Countries')
outbound_2
outbound = outbound_departures.reset_index()
#Selecting 'Total departures'
outbound = outbound[outbound['Indicators'] == 'Total departures']
outbound = outbound.drop(['Indicators', 'Units', 2020], axis = 1)
outbound = outbound.set_index('Countries')
#Replacing NaN data of outbound with values from outbound_2
outbound = outbound.combine_first(outbound_2).reset_index()
#Changing the units from thounsands to millions
outbound[outbound.select_dtypes(include = ['number']).columns] /= 1000.0
outbound.insert(0, 'Units', 'Millions')
#Setting all 0 to NaN
outbound = outbound.replace({0: np.nan})
#Forward filling before backward filling
outbound[np.arange(1995,2020)] = outbound[np.arange(1995,2020)].ffill(axis = 1).bfill(axis = 1)
#Summing the data
outbound['sum'] = outbound[np.arange(1995,2020)].sum(axis=1)
#Getting alpha 2 code
outbound['country_alpha_2'] = outbound.apply(lambda row: findCountry2(row.Countries) , axis = 1)
#Getting alpha 3 code
outbound['country_alpha_3'] = outbound.apply(lambda row: findCountry3(row.Countries) , axis = 1)
#Merging with countries to get latitude and longitude of each country
outbound = pd.merge(outbound, countries, left_on='country_alpha_2', right_on='country', how='left')
#Merging with region to get region of each country
outbound = pd.merge(outbound, region[['alpha-2', 'region']], left_on='country_alpha_2', right_on='alpha-2', how='left')
#Dropping redundant data
outbound = outbound.drop(['alpha-2', 'country', 'Countries'],axis=1)
#Rearranging the columns
cols = outbound.columns.tolist()
cols = [cols[-2]] + cols[-6:-4] + [cols[-1]] + cols[-4:-2] + cols[:-6]
outbound = outbound[cols]
#Dropping null data
outbound = outbound[outbound['name'].notna()]
#Renaming the column
outbound = outbound.rename(columns={'Units':'units'})
#Sorting through sum
outbound = outbound.sort_values(by='sum', ascending = False)
#Dropping duplicate data
outbound = outbound.drop_duplicates(subset = 'name')
#Resetting index
outbound = outbound.reset_index(drop = True)
#Setting the alpha 2 code and alpha 3 code specifically for Namibia
outbound.loc[outbound['name'] == 'Namibia', 'country_alpha_2'] = 'NA'
outbound.loc[outbound['name'] == 'Namibia', 'country_alpha_3'] = 'NAM'
outbound
| name | country_alpha_2 | country_alpha_3 | region | latitude | longitude | units | 1995 | 1996 | 1997 | ... | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | sum | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | United States of America | US | USA | Americas | 37.090240 | -95.712891 | Millions | 74.031 | 76.803 | 78.481 | ... | 114.089 | 116.329 | 118.968 | 121.699 | 130.364 | 141.526 | 148.045 | 158.4454 | 170.9301 | 2781.9205 |
| 1 | Mexico | MX | MEX | Americas | 23.634501 | -102.552784 | Millions | 103.161 | 103.442 | 107.242 | ... | 88.113 | 87.332 | 90.787 | 90.982 | 94.988 | 97.372 | 94.274 | 86.2800 | 82.7520 | 2636.9450 |
| 2 | Germany | DE | DEU | Europe | 51.165691 | 10.451526 | Millions | 55.800 | 55.800 | 55.800 | ... | 84.692 | 82.729 | 87.459 | 83.008 | 83.737 | 90.966 | 92.402 | 108.5420 | 99.5330 | 2048.5650 |
| 3 | Namibia | NA | NAM | Africa | -22.957640 | 18.490410 | Millions | 47.594 | 47.594 | 47.594 | ... | 84.816 | 85.276 | 84.414 | 84.519 | 89.082 | 91.758 | 91.304 | 92.2140 | 94.7150 | 1833.0990 |
| 4 | United Kingdom | GB | GBR | Europe | 55.378051 | -3.435973 | Millions | 41.345 | 42.050 | 45.957 | ... | 67.493 | 66.858 | 68.959 | 72.204 | 77.619 | 81.757 | 87.242 | 90.5710 | 93.0860 | 1642.6580 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 200 | Jamaica | JM | JAM | Americas | 18.109581 | -77.297508 | Millions | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0000 |
| 201 | Iraq | IQ | IRQ | Asia | 33.223191 | 43.679291 | Millions | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0000 |
| 202 | Haiti | HT | HTI | Americas | 18.971187 | -72.285215 | Millions | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0000 |
| 203 | Guyana | GY | GUY | Americas | 4.860416 | -58.930180 | Millions | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0000 |
| 204 | Lebanon | LB | LBN | Asia | 33.854721 | 35.862285 | Millions | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0000 |
205 rows × 33 columns
#Adding up 'Overnights visitors (tourists)' and 'Same-day visitors (excursionists)' to get 'Total arrivals'
inbound_2 = inbound_arrivals.reset_index()
inbound_2 = inbound_2[inbound_2['Indicators'] != 'Total arrivals']
inbound_2 = inbound_2.groupby('Countries').sum()[np.arange(1995,2020)].reset_index()
inbound_2 = inbound_2.set_index('Countries')
inbound_2
inbound = inbound_arrivals.reset_index()
#Selecting 'Total arrivals'
inbound = inbound[inbound['Indicators'] == 'Total arrivals']
inbound = inbound.drop(['Indicators', 'Units', 2020], axis = 1)
inbound = inbound.set_index('Countries')
#Replacing NaN data of inbound with values from inbound_2
inbound = inbound.combine_first(inbound_2).reset_index()
#Changing the units from thounsands to millions
inbound[inbound.select_dtypes(include = ['number']).columns] /= 1000.0
inbound.insert(0, 'Units', 'Millions')
#Setting all 0 to NaN
inbound = inbound.replace({0: np.nan})
#Forward filling before backward filling
inbound[np.arange(1995,2020)] = inbound[np.arange(1995,2020)].ffill(axis = 1).bfill(axis = 1)
#Summing the data
inbound['sum'] = inbound[np.arange(1995,2020)].sum(axis=1)
#Getting alpha 2 code
inbound['country_alpha_2'] = inbound.apply(lambda row: findCountry2(row.Countries) , axis = 1)
#Getting alpha 3 code
inbound['country_alpha_3'] = inbound.apply(lambda row: findCountry3(row.Countries) , axis = 1)
#Merging with countries to get latitude and longitude of each country
inbound = pd.merge(inbound, countries, left_on='country_alpha_2', right_on='country', how='left')
#Merging with region to get region of each country
inbound = pd.merge(inbound, region[['alpha-2', 'region']], left_on='country_alpha_2', right_on='alpha-2', how='left')
#Dropping redundant data
inbound = inbound.drop(['alpha-2', 'country', 'Countries'],axis=1)
#Rearranging the columns
cols = inbound.columns.tolist()
cols = [cols[-2]] + cols[-6:-4] + [cols[-1]] + cols[-4:-2] + cols[:-6]
inbound = inbound[cols]
#Dropping null data
inbound = inbound[inbound['name'].notna()]
#Renaming the column
inbound = inbound.rename(columns={'Units':'units'})
#Sorting through sum
inbound = inbound.sort_values(by='sum', ascending = False)
#Dropping duplicate data
inbound = inbound.drop_duplicates(subset = 'name')
#Resetting index
inbound = inbound.reset_index(drop = True)
#Setting the alpha 2 code and alpha 3 code specifically for Namibia
inbound.loc[inbound['name'] == 'Namibia', 'country_alpha_2'] = 'NA'
inbound.loc[inbound['name'] == 'Namibia', 'country_alpha_3'] = 'NAM'
inbound
| name | country_alpha_2 | country_alpha_3 | region | latitude | longitude | units | 1995 | 1996 | 1997 | ... | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | sum | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | France | FR | FRA | Europe | 46.227638 | 2.213749 | Millions | 60.033 | 148.263 | 157.551 | ... | 196.595000 | 197.522000 | 204.410000 | 206.599000 | 203.302000 | 203.042000 | 207.274000 | 211.998000 | 217.877 | 4001.252000 |
| 1 | United States of America | US | USA | Americas | 37.090240 | -95.712891 | Millions | 79.732 | 82.756 | 82.525 | ... | 147.271416 | 171.629897 | 179.309907 | 178.311354 | 176.864526 | 175.261488 | 174.291746 | 169.324918 | 165.478 | 3205.206252 |
| 2 | China | CN | CHN | Asia | 35.861660 | 104.195397 | Millions | 46.387 | 51.128 | 57.588 | ... | 135.423000 | 132.405000 | 129.078000 | 128.499000 | 133.820000 | 141.774000 | 153.260000 | 158.606000 | 162.538 | 2805.217000 |
| 3 | Mexico | MX | MEX | Americas | 23.634501 | -102.552784 | Millions | 85.446 | 90.394 | 92.915 | ... | 75.732000 | 76.749000 | 78.100000 | 81.042000 | 87.129000 | 94.853000 | 99.349000 | 96.497000 | 97.406 | 2306.193000 |
| 4 | Spain | ES | ESP | Europe | 40.463667 | -3.749220 | Millions | 52.460 | 55.077 | 62.415 | ... | 99.187000 | 98.128000 | 103.231000 | 107.144000 | 109.834000 | 115.561000 | 121.717000 | 124.456000 | 126.170 | 2284.187000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 200 | Somalia | SO | SOM | Africa | 5.152149 | 46.199616 | Millions | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 |
| 201 | Equatorial Guinea | GQ | GNQ | Africa | 1.650801 | 10.267895 | Millions | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 |
| 202 | Nauru | NR | NRU | Oceania | -0.522778 | 166.931503 | Millions | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 |
| 203 | Liberia | LR | LBR | Africa | 6.428055 | -9.429499 | Millions | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 |
| 204 | Afghanistan | AF | AFG | Asia | 33.939110 | 67.709953 | Millions | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 |
205 rows × 33 columns
#Adding up 'Overnights visitors (tourists)' and 'Same-day visitors (excursionists)' to get 'Total trips'
domestic_2 = domestic_trips.reset_index()
domestic_2 = domestic_2[domestic_2['Indicators'] != 'Total trips']
domestic_2 = domestic_2.groupby('Countries').sum()[np.arange(1995,2020)].reset_index()
domestic_2 = domestic_2.set_index('Countries')
domestic_2
domestic = domestic_trips.reset_index()
#Selecting 'Total trips'
domestic = domestic[domestic['Indicators'] == 'Total trips']
domestic = domestic.drop(['Indicators', 'Units', 2020], axis = 1)
domestic = domestic.set_index('Countries')
#Replacing NaN data of domestic with values from domestic_2
domestic = domestic.combine_first(domestic_2).reset_index()
#Changing the units from thounsands to millions
domestic[domestic.select_dtypes(include = ['number']).columns] /= 1000.0
domestic.insert(0, 'Units', 'Millions')
#Setting all 0 to NaN
domestic = domestic.replace({0: np.nan})
#Forward filling before backward filling
domestic[np.arange(1995,2020)] = domestic[np.arange(1995,2020)].ffill(axis = 1).bfill(axis = 1)
#Summing the data
domestic['sum'] = domestic[np.arange(1995,2020)].sum(axis=1)
#Getting alpha 2 code
domestic['country_alpha_2'] = domestic.apply(lambda row: findCountry2(row.Countries) , axis = 1)
#Getting alpha 3 code
domestic['country_alpha_3'] = domestic.apply(lambda row: findCountry3(row.Countries) , axis = 1)
#Merging with countries to get latitude and longitude of each country
domestic = pd.merge(domestic, countries, left_on='country_alpha_2', right_on='country', how='left')
#Merging with region to get region of each country
domestic = pd.merge(domestic, region[['alpha-2', 'region']], left_on='country_alpha_2', right_on='alpha-2', how='left')
#Dropping redundant data
domestic = domestic.drop(['alpha-2', 'country', 'Countries'],axis=1)
#Rearranging the columns
cols = domestic.columns.tolist()
cols = [cols[-2]] + cols[-6:-4] + [cols[-1]] + cols[-4:-2] + cols[:-6]
domestic = domestic[cols]
#Dropping null data
domestic = domestic[domestic['name'].notna()]
#Renaming the column
domestic = domestic.rename(columns={'Units':'units'})
#Sorting through sum
domestic = domestic.sort_values(by='sum', ascending = False)
#Dropping duplicate data
domestic = domestic.drop_duplicates(subset = 'name')
#Resetting index
domestic = domestic.reset_index(drop = True)
#Setting the alpha 2 code and alpha 3 code specifically for Namibia
domestic.loc[domestic['name'] == 'Namibia', 'country_alpha_2'] = 'NA'
domestic.loc[domestic['name'] == 'Namibia', 'country_alpha_3'] = 'NAM'
domestic
| name | country_alpha_2 | country_alpha_3 | region | latitude | longitude | units | 1995 | 1996 | 1997 | ... | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | sum | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | China | CN | CHN | Asia | 35.861660 | 104.195397 | Millions | 629.000 | 640.000 | 644.000 | ... | 2641.000 | 2957.00 | 3262.000 | 3611.000 | 3990.000 | 4435.000 | 5010.000 | 5539.000 | 6005.852 | 55088.852 |
| 1 | United States of America | US | USA | Americas | 37.090240 | -95.712891 | Millions | 2004.500 | 2004.500 | 2004.500 | ... | 1998.500 | 2030.30 | 2059.600 | 2109.300 | 2178.700 | 2206.500 | 2248.700 | 2291.100 | 2326.623 | 51336.523 |
| 2 | India | IN | IND | Asia | 20.593684 | 78.962880 | Millions | 136.644 | 140.120 | 159.877 | ... | 864.533 | 1045.05 | 1142.529 | 1282.802 | 1431.974 | 1615.389 | 1657.546 | 1853.788 | 2321.983 | 18773.016 |
| 3 | United Kingdom | GB | GBR | Europe | 55.378051 | -3.435973 | Millions | 126.010 | 126.010 | 126.010 | ... | 1668.640 | 1836.02 | 1710.905 | 1698.942 | 1649.626 | 1953.655 | 1914.076 | 1821.956 | 1776.080 | 18039.484 |
| 4 | Japan | JP | JPN | Asia | 36.204824 | 138.252924 | Millions | 734.558 | 734.558 | 734.558 | ... | 612.525 | 612.75 | 630.950 | 595.221 | 604.715 | 641.079 | 647.510 | 561.779 | 587.103 | 17111.936 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 200 | Iraq | IQ | IRQ | Asia | 33.223191 | 43.679291 | Millions | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000 |
| 201 | Jamaica | JM | JAM | Americas | 18.109581 | -77.297508 | Millions | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000 |
| 202 | Kenya | KE | KEN | Africa | -0.023559 | 37.906193 | Millions | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000 |
| 203 | Kiribati | KI | KIR | Oceania | -3.370417 | -168.734039 | Millions | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000 |
| 204 | Lebanon | LB | LBN | Asia | 33.854721 | 35.862285 | Millions | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000 |
205 rows × 33 columns
I have chose to mainly use folium and plotly to display the data. This is because these libraries allow interactive plots, which I believe lets the viewer understand the plots better as they can explore it themselves.
I will be finding this through maps and bar graphs. The process is similar for the three subquestions. I will sum up all the data from 1995 to 2019.
Firstly, I will colour the countries on the world map. The colours of countries represent their respective values. I chose to use red and blue colour scale to highlight the countries with the highest and lowest outbound tourism. I used a map to allow the viewers to also get a sense of the location and size of the countries. It also allows for easier comparison between different countries.
#Creating map
outbound_map = folium.Map()
#Adding choropleth map
folium.Choropleth(geo_data = 'https://github.com/simonepri/geo-maps/releases/download/v0.6.0/countries-land-10km.geo.json',
data = outbound,
columns = ['country_alpha_3', 'sum'],
key_on = 'feature.properties.A3',
fill_color = 'RdBu',
fill_opacity = 0.8,
line_opacity = 0.2,
legend_name = 'Total Number of Outbound Tourism from 1995 to 2019 in Millions'
).add_to(outbound_map)
outbound_map
Next, I added markers on top of the map. Clicking on the markers show the countries' names and their respective amount of outbound tourism from 1995 to 2019.
#Adding markers
for i in range(0,len(outbound)):
folium.Marker(
location = [outbound.iloc[i]['latitude'], outbound.iloc[i]['longitude']],
popup = outbound.iloc[i]['name'] + '\n' + str(int(outbound.iloc[i]['sum'] * 1000000))
).add_to(outbound_map)
outbound_map
I will now be plotting the bar graphs. I first consolidated the countries and their respective total number of outbound tourism from 1995 to 2019.
outbound_sum = outbound.copy()
#Setting name as index
outbound_sum = outbound_sum.set_index('name')
#Dropping redundant data
outbound_sum = outbound_sum[['sum']]
#Sorting data by sum
outbound_sum = outbound_sum.sort_values(by = 'sum', ascending = False)
#Dropping data <= 0
outbound_sum = outbound_sum[outbound_sum['sum'] > 0].T
#Multiplying by a million
outbound_sum *= 1000000
outbound_sum
| name | United States of America | Mexico | Germany | Namibia | United Kingdom | China | Italy | Poland | Canada | Russia | ... | Central African Republic | Vanuatu | Cook Islands | Nigeria | Tajikistan | São Tomé and Príncipe | Palau | Angola | Tuvalu | Niue |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sum | 2.781920e+09 | 2.636945e+09 | 2.048565e+09 | 1.833099e+09 | 1.642658e+09 | 1.456823e+09 | 1.196624e+09 | 1.144544e+09 | 952199900.0 | 756519000.0 | ... | 485000.0 | 456400.0 | 279300.0 | 250000.0 | 236600.0 | 235400.0 | 225000.0 | 75000.0 | 60500.0 | 33200.0 |
1 rows × 132 columns
Here is a vertical bar graph. The top 15 countries are displayed by default and the viewers can adjust the plot accordingly to look at the values for other countries.
#Plotting vertical bar graph and setting the x-axis to show the top 15 countries by default
outbound_vbar = px.bar(outbound_sum.T, title = "Countries' Total Number of Outbound Tourism from 1995 to 2019", labels = {'value': 'Total Number of Outbound Tourism', 'name': 'Country'}).for_each_trace(lambda t: t.update(name = {'sum': 'Total Number of Outbound Tourism'}[t.name])).update_layout(xaxis_rangeslider_visible=True, xaxis_range=[-0.5, 14.5]).update_yaxes(fixedrange = False)
outbound_vbar
Similarly, here is a horizontal bar graph. I prefer this way of displaying as it acts as a form of ranking too.
#Plotting horizontal bar graph, reversing the order, and setting the y-axis to show the top 15 countries by default
outbound_hbar = px.bar(outbound_sum.T.sort_values(by = 'sum', ascending = True), title = "Countries' Total Number of Outbound Tourism from 1995 to 2019", labels = {'value': 'Total Number of Outbound Tourism', 'name': 'Country'}, orientation='h').for_each_trace(lambda t: t.update(name = {'sum': 'Total Number of Outbound Tourism'}[t.name])).update_layout(yaxis_range=[len(outbound_sum.columns)-15.5, len(outbound_sum.columns)-0.5])
outbound_hbar
The process is similar to that of outbound tourism.
#Creating map
inbound_map = folium.Map()
#Adding choropleth map
folium.Choropleth(geo_data = 'https://github.com/simonepri/geo-maps/releases/download/v0.6.0/countries-land-10km.geo.json',
data = inbound,
columns = ['country_alpha_3', 'sum'],
key_on = 'feature.properties.A3',
fill_color = 'RdBu',
fill_opacity = 0.8,
line_opacity = 0.2,
legend_name = 'Total Number of Inbound Tourism from 1995 to 2019 in Millions'
).add_to(inbound_map)
inbound_map
#Adding markers
for i in range(0,len(inbound)):
folium.Marker(
location = [inbound.iloc[i]['latitude'], inbound.iloc[i]['longitude']],
popup = inbound.iloc[i]['name'] + '\n' + str(int(inbound.iloc[i]['sum'] * 1000000))
).add_to(inbound_map)
inbound_map
inbound_sum = inbound.copy()
#Setting name as index
inbound_sum = inbound_sum.set_index('name')
#Dropping redundant data
inbound_sum = inbound_sum[['sum']]
#Sorting data by sum
inbound_sum = inbound_sum.sort_values(by = 'sum', ascending = False)
#Dropping data <= 0
inbound_sum = inbound_sum[inbound_sum['sum'] > 0].T
#Multiplying by a million
inbound_sum *= 1000000
inbound_sum
| name | France | United States of America | China | Mexico | Spain | Poland | Italy | Croatia | Hungary | Canada | ... | Micronesia | Comoros | Guinea Bissau | Kiribati | Solomon Islands | São Tomé and Príncipe | Montserrat | Marshall Islands | Niue | Tuvalu |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sum | 4.001252e+09 | 3.205206e+09 | 2.805217e+09 | 2.306193e+09 | 2.284187e+09 | 1.804662e+09 | 1.762915e+09 | 1.101202e+09 | 1.021397e+09 | 876738000.0 | ... | 630200.0 | 586900.0 | 556100.0 | 555100.0 | 407400.0 | 351500.0 | 303305.0 | 152400.0 | 119100.0 | 38700.0 |
1 rows × 200 columns
At first, I used matplotlib but prefered plotly in the end.
#Setting figure size and resolution of figure
plt.figure(figsize=(15,5), dpi=300)
#Plotting vertical bar graph
inbound_sum_plot = sns.barplot(data = inbound_sum.T.head(15).T, color = 'skyblue')
#Rotating the x-axis labels so that they do not overlap
inbound_sum_plot.set_xticklabels(inbound_sum_plot.get_xticklabels(), rotation=45, ha="right")
#Naming x-axis
inbound_sum_plot.set_xlabel('Country')
#Naming y-axis
inbound_sum_plot.set_ylabel('Total Number of Inbound Tourism from 1995 to 2019')
#Naming the figure
inbound_sum_plot.set_title('Top 15 Countries with Largest Total Number of Inbound Tourism from 1995 to 2019')
Text(0.5, 1.0, 'Top 15 Countries with Largest Total Number of Inbound Tourism from 1995 to 2019')
#Plotting vertical bar graph and setting the x-axis to show the top 15 countries by default
inbound_vbar = px.bar(inbound_sum.T, title = "Countries' Total Number of Inbound Tourism from 1995 to 2019", labels = {'value': 'Total Number of Inbound Tourism', 'name': 'Country'}).for_each_trace(lambda t: t.update(name = {'sum': 'Total Number of Inbound Tourism'}[t.name])).update_layout(xaxis_rangeslider_visible=True, xaxis_range=[-0.5, 14.5]).update_yaxes(fixedrange = False)
inbound_vbar
#Plotting horizontal bar graph, reversing the order, and setting the y-axis to show the top 15 countries by default
inbound_hbar = px.bar(inbound_sum.T.sort_values(by = 'sum', ascending = True), title = "Countries' Total Number of Inbound Tourism from 1995 to 2019", labels = {'value': 'Total Number of Inbound Tourism', 'name': 'Country'}, orientation='h').for_each_trace(lambda t: t.update(name = {'sum': 'Total Number of Inbound Tourism'}[t.name])).update_layout(yaxis_range=[len(inbound_sum.columns)-15.5, len(inbound_sum.columns)-0.5])
inbound_hbar
Finally, domestic tourism is also similar to outbound tourism and inbound tourism.
#Creating map
domestic_map = folium.Map()
#Adding choropleth map
folium.Choropleth(geo_data = 'https://github.com/simonepri/geo-maps/releases/download/v0.6.0/countries-land-10km.geo.json',
data = domestic,
columns = ['country_alpha_3', 'sum'],
key_on = 'feature.properties.A3',
fill_color = 'RdBu',
fill_opacity = 0.8,
line_opacity = 0.2,
legend_name = 'Total Number of Domestic Tourism from 1995 to 2019 in Millions'
).add_to(domestic_map)
domestic_map
#Adding markers
for i in range(0,len(domestic)):
folium.Marker(
location = [domestic.iloc[i]['latitude'], domestic.iloc[i]['longitude']],
popup = domestic.iloc[i]['name'] + '\n' + str(int(domestic.iloc[i]['sum'] * 1000000))
).add_to(domestic_map)
domestic_map
domestic_sum = domestic.copy()
#Setting name as index
domestic_sum = domestic_sum.set_index('name')
#Dropping redundant data
domestic_sum = domestic_sum[['sum']]
#Sorting data by sum
domestic_sum = domestic_sum.sort_values(by = 'sum', ascending = False)
#Dropping data <= 0
domestic_sum = domestic_sum[domestic_sum['sum'] > 0].T
#Multiplying by a million
domestic_sum *= 1000000
domestic_sum
| name | China | United States of America | India | United Kingdom | Japan | Spain | Canada | Indonesia | France | Australia | ... | Trinidad and Tobago | Armenia | Senegal | Malta | Swaziland | Luxembourg | Tajikistan | Madagascar | Moldova | Mali |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sum | 5.508885e+10 | 5.133652e+10 | 1.877302e+10 | 1.803948e+10 | 1.711194e+10 | 8.054441e+09 | 6.169413e+09 | 6.123024e+09 | 6.109938e+09 | 6.092891e+09 | ... | 22442000.0 | 15368000.0 | 8685000.0 | 5268000.0 | 4909000.0 | 1892000.0 | 1666000.0 | 1133000.0 | 942700.0 | 908500.0 |
1 rows × 84 columns
#Plotting vertical bar graph and setting the x-axis to show the top 15 countries by default
domestic_vbar = px.bar(domestic_sum.T, title = "Countries' Total Number of Domestic Tourism from 1995 to 2019", labels = {'value': 'Total Number of Domestic Tourism', 'name': 'Country'}).for_each_trace(lambda t: t.update(name = {'sum': 'Total Number of Domestic Tourism'}[t.name])).update_layout(xaxis_rangeslider_visible=True, xaxis_range=[-0.5, 14.5]).update_yaxes(fixedrange = False)
domestic_vbar
#Plotting horizontal bar graph, reversing the order, and setting the y-axis to show the top 15 countries by default
domestic_hbar = px.bar(domestic_sum.T.sort_values(by = 'sum', ascending = True), title = "Countries' Total Number of Domestic Tourism from 1995 to 2019", labels = {'value': 'Total Number of Domestic Tourism', 'name': 'Country'}, orientation='h').for_each_trace(lambda t: t.update(name = {'sum': 'Total Number of Domestic Tourism'}[t.name])).update_layout(yaxis_range=[len(domestic_sum.columns)-15.5, len(domestic_sum.columns)-0.5])
domestic_hbar
In my opinion, a country's popularity based on tourism is the number of arrivals. Hence, I chose to use the sum of inbound tourism and domestic tourism.
First, I will be merging the inbound tourism and domestic tourism.
total_arrivals = inbound.copy()
#Dropping redundant data
total_arrivals = total_arrivals[['name', 'country_alpha_2', 'country_alpha_3', 'region', 'latitude', 'longitude', 'units', 'sum']]
#Renaming column
total_arrivals = total_arrivals.rename(columns = {'sum': 'inbound'})
#Merging
total_arrivals = pd.merge(total_arrivals, domestic[['name', 'sum']], on = 'name')
#Renaming column
total_arrivals = total_arrivals.rename(columns = {'sum': 'domestic'})
#Summing the total number of arrivals
total_arrivals['total_arrivals'] = total_arrivals[['inbound', 'domestic']].sum(axis = 1)
#Sorting the data by total number of arrivals
total_arrivals = total_arrivals.sort_values(by = 'total_arrivals', ascending = False).reset_index(drop = True)
total_arrivals
| name | country_alpha_2 | country_alpha_3 | region | latitude | longitude | units | inbound | domestic | total_arrivals | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | China | CN | CHN | Asia | 35.861660 | 104.195397 | Millions | 2805.217000 | 55088.852 | 57894.069000 |
| 1 | United States of America | US | USA | Americas | 37.090240 | -95.712891 | Millions | 3205.206252 | 51336.523 | 54541.729252 |
| 2 | India | IN | IND | Asia | 20.593684 | 78.962880 | Millions | 167.871000 | 18773.016 | 18940.887000 |
| 3 | United Kingdom | GB | GBR | Europe | 55.378051 | -3.435973 | Millions | 770.019000 | 18039.484 | 18809.503000 |
| 4 | Japan | JP | JPN | Asia | 36.204824 | 138.252924 | Millions | 266.116000 | 17111.936 | 17378.052000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 200 | Somalia | SO | SOM | Africa | 5.152149 | 46.199616 | Millions | 0.000000 | 0.000 | 0.000000 |
| 201 | Equatorial Guinea | GQ | GNQ | Africa | 1.650801 | 10.267895 | Millions | 0.000000 | 0.000 | 0.000000 |
| 202 | Nauru | NR | NRU | Oceania | -0.522778 | 166.931503 | Millions | 0.000000 | 0.000 | 0.000000 |
| 203 | Liberia | LR | LBR | Africa | 6.428055 | -9.429499 | Millions | 0.000000 | 0.000 | 0.000000 |
| 204 | Afghanistan | AF | AFG | Asia | 33.939110 | 67.709953 | Millions | 0.000000 | 0.000 | 0.000000 |
205 rows × 10 columns
Similar to the first question, I will be plotting maps and bar graphs.
#Creating map
arrivals_map = folium.Map()
#Adding choropleth map
folium.Choropleth(geo_data = 'https://github.com/simonepri/geo-maps/releases/download/v0.6.0/countries-land-10km.geo.json',
data = total_arrivals,
columns = ['country_alpha_3', 'total_arrivals'],
key_on = 'feature.properties.A3',
fill_color = 'RdBu',
fill_opacity = 0.8,
line_opacity = 0.2,
legend_name = 'Total Number of Arrivals from 1995 to 2019 in Millions'
).add_to(arrivals_map)
arrivals_map
#Adding markers
for i in range(0,len(total_arrivals)):
folium.Marker(
location = [total_arrivals.iloc[i]['latitude'], total_arrivals.iloc[i]['longitude']],
popup = total_arrivals.iloc[i]['name'] + '\n' + str(int(total_arrivals.iloc[i]['total_arrivals'] * 1000000))
).add_to(arrivals_map)
arrivals_map
Next, I will be consolidating the data.
#Dropping redundant data
arrivals_sum = total_arrivals[['name', 'total_arrivals']]
#Setting name as index
arrivals_sum = arrivals_sum.set_index('name')
#Multiplying by a million
arrivals_sum *= 1000000
#Dropping data <= 0
arrivals_sum = arrivals_sum[arrivals_sum['total_arrivals'] > 0]
arrivals_sum
| total_arrivals | |
|---|---|
| name | |
| China | 5.789407e+10 |
| United States of America | 5.454173e+10 |
| India | 1.894089e+10 |
| United Kingdom | 1.880950e+10 |
| Japan | 1.737805e+10 |
| ... | ... |
| São Tomé and Príncipe | 3.515000e+05 |
| Montserrat | 3.033050e+05 |
| Marshall Islands | 1.524000e+05 |
| Niue | 1.191000e+05 |
| Tuvalu | 3.870000e+04 |
200 rows × 1 columns
#Plotting vertical bar graph and setting the x-axis to show the top 15 countries by default
px.bar(arrivals_sum, title = 'Total Number of Arrivals from 1995 to 2019', labels = {'value': 'Total Number of Arrivals', 'name': 'Country'}).for_each_trace(lambda t: t.update(name = {'total_arrivals': 'Total Number of Arrivals'}[t.name])).update_layout(xaxis_range=[-0.5, 14.5])
#Plotting horizontal bar graph, reversing the order, and setting the y-axis to show the top 15 countries by default
arrivals_bar = px.bar(arrivals_sum.sort_values(by = 'total_arrivals', ascending = True), title = 'Total Number of Arrivals from 1995 to 2019', labels = {'value': 'Total Number of Arrivals', 'name': 'Country'}, orientation='h').for_each_trace(lambda t: t.update(name = {'total_arrivals': 'Total Number of Arrivals'}[t.name])).update_layout(yaxis_range=[len(arrivals_sum.index)-15.5, len(arrivals_sum.index)-0.5])
arrivals_bar
Now, I will be plotting stacked bar graphs as it shows the individual components of each country.
#Dropping redundant data
arrivals_stacked = total_arrivals[['name', 'inbound', 'domestic', 'total_arrivals']]
#Setting name and total_arrivals as the index
arrivals_stacked = arrivals_stacked.set_index(['name', 'total_arrivals'])
#Stack the data and renaming the columns
arrivals_stacked = arrivals_stacked.stack().to_frame().reset_index().rename(columns = {'level_2': 'type', 0: 'sum'})
#Dropping redundant data
arrivals_stacked = arrivals_stacked[['name', 'type', 'sum', 'total_arrivals']]
#Multiplying numerical data by a million
arrivals_stacked[arrivals_stacked.select_dtypes(include = ['number']).columns] *= 1000000
#Dropping data <= 0
arrivals_stacked = arrivals_stacked[arrivals_stacked['total_arrivals'] > 0]
arrivals_stacked
| name | type | sum | total_arrivals | |
|---|---|---|---|---|
| 0 | China | inbound | 2.805217e+09 | 5.789407e+10 |
| 1 | China | domestic | 5.508885e+10 | 5.789407e+10 |
| 2 | United States of America | inbound | 3.205206e+09 | 5.454173e+10 |
| 3 | United States of America | domestic | 5.133652e+10 | 5.454173e+10 |
| 4 | India | inbound | 1.678710e+08 | 1.894089e+10 |
| ... | ... | ... | ... | ... |
| 395 | Marshall Islands | domestic | 0.000000e+00 | 1.524000e+05 |
| 396 | Niue | inbound | 1.191000e+05 | 1.191000e+05 |
| 397 | Niue | domestic | 0.000000e+00 | 1.191000e+05 |
| 398 | Tuvalu | inbound | 3.870000e+04 | 3.870000e+04 |
| 399 | Tuvalu | domestic | 0.000000e+00 | 3.870000e+04 |
400 rows × 4 columns
#Plotting horizontal stacked bar graph, reversing the order, and setting the y-axis to show the top 15 countries by default
arrivals_stacked_bar = px.bar(arrivals_stacked.sort_values(by = 'total_arrivals', ascending = True), x = 'sum', y = 'name', color = 'type', title = 'Total Number of Arrivals from 1995 to 2019', labels = {'sum': 'Total Number of Arrivals', 'name': 'Country'}, orientation = 'h').update_layout(yaxis_range=[len(arrivals_stacked.index)/2.0-15.5, len(arrivals_stacked.index)/2.0-0.5])
arrivals_stacked_bar
Here, I will be comparing the data between regions and countries. Since this is a multivariate quantitative against categorical data analysis, I will be using treemap, grouped bar graph, categorical scatterplot, side-by-side boxplot, and side-by-side violinplot.
outbound_region = outbound.copy()
#Setting name as the index
outbound_region = outbound_region.set_index('name')
#Dropping redundant data
outbound_region = outbound_region[['region', 'sum']]
#Sorting data by sum
outbound_region = outbound_region.sort_values(by = 'sum', ascending = False)
#Dropping data <= 0
outbound_region = outbound_region[outbound_region['sum'] > 0]
#Multiplying numerical data by a million
outbound_region[outbound_region.select_dtypes(include=['number']).columns] *= 1000000
#Resetting the index
outbound_region = outbound_region.reset_index()
outbound_region
| name | region | sum | |
|---|---|---|---|
| 0 | United States of America | Americas | 2.781920e+09 |
| 1 | Mexico | Americas | 2.636945e+09 |
| 2 | Germany | Europe | 2.048565e+09 |
| 3 | Namibia | Africa | 1.833099e+09 |
| 4 | United Kingdom | Europe | 1.642658e+09 |
| ... | ... | ... | ... |
| 127 | São Tomé and Príncipe | Africa | 2.354000e+05 |
| 128 | Palau | Oceania | 2.250000e+05 |
| 129 | Angola | Africa | 7.500000e+04 |
| 130 | Tuvalu | Oceania | 6.050000e+04 |
| 131 | Niue | Oceania | 3.320000e+04 |
132 rows × 3 columns
#Plotting treemap
outbound_tree = px.treemap(outbound_region, path=[px.Constant('World'), 'region', 'name'], values='sum', color='sum', color_continuous_scale='viridis', title = 'Total Number of Outbound Tourism from 1995 to 2019', width = 800, height = 800).update_layout(coloraxis_colorbar=dict(title = 'Total Number of Outbound Tourism'))
outbound_tree
inbound_region = inbound.copy()
#Setting name as the index
inbound_region = inbound_region.set_index('name')
#Dropping redundant data
inbound_region = inbound_region[['region', 'sum']]
#Sorting data by sum
inbound_region = inbound_region.sort_values(by = 'sum', ascending = False)
#Dropping data <= 0
inbound_region = inbound_region[inbound_region['sum'] > 0]
#Multiplying numerical data by a million
inbound_region[inbound_region.select_dtypes(include=['number']).columns] *= 1000000
#Resetting the index
inbound_region = inbound_region.reset_index()
inbound_region
| name | region | sum | |
|---|---|---|---|
| 0 | France | Europe | 4.001252e+09 |
| 1 | United States of America | Americas | 3.205206e+09 |
| 2 | China | Asia | 2.805217e+09 |
| 3 | Mexico | Americas | 2.306193e+09 |
| 4 | Spain | Europe | 2.284187e+09 |
| ... | ... | ... | ... |
| 195 | São Tomé and Príncipe | Africa | 3.515000e+05 |
| 196 | Montserrat | Americas | 3.033050e+05 |
| 197 | Marshall Islands | Oceania | 1.524000e+05 |
| 198 | Niue | Oceania | 1.191000e+05 |
| 199 | Tuvalu | Oceania | 3.870000e+04 |
200 rows × 3 columns
#Plotting treemap
inbound_tree = px.treemap(inbound_region, path=[px.Constant('World'), 'region', 'name'], values='sum', color='sum', color_continuous_scale='viridis', title = 'Total Number of Inbound Tourism from 1995 to 2019', width = 800, height = 800).update_layout(coloraxis_colorbar=dict(title = 'Total Number of Inbound Tourism'))
inbound_tree
domestic_region = domestic.copy()
#Setting name as the index
domestic_region = domestic_region.set_index('name')
#Dropping redundant data
domestic_region = domestic_region[['region', 'sum']]
#Sorting data by sum
domestic_region = domestic_region.sort_values(by = 'sum', ascending = False)
#Dropping data <= 0
domestic_region = domestic_region[domestic_region['sum'] > 0]
#Multiplying numerical data by a million
domestic_region[domestic_region.select_dtypes(include=['number']).columns] *= 1000000
#Resetting the index
domestic_region = domestic_region.reset_index()
domestic_region
| name | region | sum | |
|---|---|---|---|
| 0 | China | Asia | 5.508885e+10 |
| 1 | United States of America | Americas | 5.133652e+10 |
| 2 | India | Asia | 1.877302e+10 |
| 3 | United Kingdom | Europe | 1.803948e+10 |
| 4 | Japan | Asia | 1.711194e+10 |
| ... | ... | ... | ... |
| 79 | Luxembourg | Europe | 1.892000e+06 |
| 80 | Tajikistan | Asia | 1.666000e+06 |
| 81 | Madagascar | Africa | 1.133000e+06 |
| 82 | Moldova | Europe | 9.427000e+05 |
| 83 | Mali | Africa | 9.085000e+05 |
84 rows × 3 columns
#Plotting treemap
domestic_tree = px.treemap(domestic_region, path=[px.Constant('World'), 'region', 'name'], values='sum', color='sum', color_continuous_scale='viridis', title = 'Total Number of Domestic Tourism from 1995 to 2019', width = 800, height = 800).update_layout(coloraxis_colorbar=dict(title = 'Total Number of Domestic Tourism'))
domestic_tree
arrivals_region = total_arrivals.copy()
#Setting name as the index
arrivals_region = arrivals_region.set_index('name')
#Dropping redundant data
arrivals_region = arrivals_region[['region', 'total_arrivals']]
#Sorting data by total_arrivals
arrivals_region = arrivals_region.sort_values(by = 'total_arrivals', ascending = False)
#Dropping data <= 0
arrivals_region = arrivals_region[arrivals_region['total_arrivals'] > 0]
#Multiplying numerical data by a million
arrivals_region[arrivals_region.select_dtypes(include=['number']).columns] *= 1000000
#Resetting the index
arrivals_region = arrivals_region.reset_index()
arrivals_region
| name | region | total_arrivals | |
|---|---|---|---|
| 0 | China | Asia | 5.789407e+10 |
| 1 | United States of America | Americas | 5.454173e+10 |
| 2 | India | Asia | 1.894089e+10 |
| 3 | United Kingdom | Europe | 1.880950e+10 |
| 4 | Japan | Asia | 1.737805e+10 |
| ... | ... | ... | ... |
| 195 | São Tomé and Príncipe | Africa | 3.515000e+05 |
| 196 | Montserrat | Americas | 3.033050e+05 |
| 197 | Marshall Islands | Oceania | 1.524000e+05 |
| 198 | Niue | Oceania | 1.191000e+05 |
| 199 | Tuvalu | Oceania | 3.870000e+04 |
200 rows × 3 columns
#Plotting treemap
arrivals_tree = px.treemap(arrivals_region, path=[px.Constant('World'), 'region', 'name'], values='total_arrivals', color='total_arrivals', color_continuous_scale='RdBu', title = 'Total Number of Arrivals from 1995 to 2019', width = 800, height = 800).update_layout(coloraxis_colorbar=dict(title = 'Total Number of Arrivals'))
arrivals_tree
First, I will be merging the outbound tourism, inbound tourism, and domestic tourism.
total = outbound.copy()
#Dropping redundant data
total = total[['name', 'country_alpha_2', 'country_alpha_3', 'region', 'latitude', 'longitude', 'units', 'sum']]
#Renaming column
total = total.rename(columns = {'sum': 'outbound'})
#Merging
total = pd.merge(total, inbound[['name', 'sum']], on = 'name')
#Renaming column
total = total.rename(columns = {'sum': 'inbound'})
#Merging
total = pd.merge(total, domestic[['name', 'sum']], on = 'name')
#Renaming column
total = total.rename(columns = {'sum': 'domestic'})
#Summing the total number of tourism
total['total'] = total[['outbound', 'inbound', 'domestic']].sum(axis = 1)
#Sorting the data by total number of tourism
total = total.sort_values(by = 'total', ascending = False).reset_index(drop = True)
total
| name | country_alpha_2 | country_alpha_3 | region | latitude | longitude | units | outbound | inbound | domestic | total | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | China | CN | CHN | Asia | 35.861660 | 104.195397 | Millions | 1456.8230 | 2805.217000 | 55088.852 | 59350.892000 |
| 1 | United States of America | US | USA | Americas | 37.090240 | -95.712891 | Millions | 2781.9205 | 3205.206252 | 51336.523 | 57323.649752 |
| 2 | United Kingdom | GB | GBR | Europe | 55.378051 | -3.435973 | Millions | 1642.6580 | 770.019000 | 18039.484 | 20452.161000 |
| 3 | India | IN | IND | Asia | 20.593684 | 78.962880 | Millions | 287.1590 | 167.871000 | 18773.016 | 19228.046000 |
| 4 | Japan | JP | JPN | Asia | 36.204824 | 138.252924 | Millions | 422.0640 | 266.116000 | 17111.936 | 17800.116000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 200 | Liberia | LR | LBR | Africa | 6.428055 | -9.429499 | Millions | 0.0000 | 0.000000 | 0.000 | 0.000000 |
| 201 | Nauru | NR | NRU | Oceania | -0.522778 | 166.931503 | Millions | 0.0000 | 0.000000 | 0.000 | 0.000000 |
| 202 | Somalia | SO | SOM | Africa | 5.152149 | 46.199616 | Millions | 0.0000 | 0.000000 | 0.000 | 0.000000 |
| 203 | Afghanistan | AF | AFG | Asia | 33.939110 | 67.709953 | Millions | 0.0000 | 0.000000 | 0.000 | 0.000000 |
| 204 | Equatorial Guinea | GQ | GNQ | Africa | 1.650801 | 10.267895 | Millions | 0.0000 | 0.000000 | 0.000 | 0.000000 |
205 rows × 11 columns
total_region = total.copy()
#Setting name as the index
total_region = total_region.set_index('name')
#Dropping redundant data
total_region = total_region[['region', 'total']]
#Sorting data by total
total_region = total_region.sort_values(by = 'total', ascending = False)
#Dropping data <= 0
total_region = total_region[total_region['total'] > 0]
#Multiplying numerical data by a million
total_region[total_region.select_dtypes(include=['number']).columns] *= 1000000
#Resetting the index
total_region = total_region.reset_index()
total_region
| name | region | total | |
|---|---|---|---|
| 0 | China | Asia | 5.935089e+10 |
| 1 | United States of America | Americas | 5.732365e+10 |
| 2 | United Kingdom | Europe | 2.045216e+10 |
| 3 | India | Asia | 1.922805e+10 |
| 4 | Japan | Asia | 1.780012e+10 |
| ... | ... | ... | ... |
| 195 | Solomon Islands | Oceania | 4.074000e+05 |
| 196 | Montserrat | Americas | 3.033050e+05 |
| 197 | Marshall Islands | Oceania | 1.524000e+05 |
| 198 | Niue | Oceania | 1.523000e+05 |
| 199 | Tuvalu | Oceania | 9.920000e+04 |
200 rows × 3 columns
#Plotting treemap
total_tree = px.treemap(total_region, path=[px.Constant('World'), 'region', 'name'], values='total', color='total', color_continuous_scale='RdBu', title = 'Total Number of Tourism from 1995 to 2019', width = 800, height = 800).update_layout(coloraxis_colorbar=dict(title = 'Total Number of Tourism'))
total_tree
#Plotting grouped bar graph
px.bar(total_region, x = 'name', y = 'total', color = 'region', title = 'Total Number of Tourism in Each Region', labels = {'total': 'Total Number of Tourism', 'name': 'Country'})
#Plotting vertical grouped bar graph using only the top 5 countries of each region
px.bar(total_region.groupby('region').head(5), x = 'name', y = 'total', color = 'region', title = 'Top 5 Countries with Highest Total Number of Tourism in Each Region', labels = {'total': 'Total Number of Tourism', 'name': 'Country'})
#Plotting horizontal grouped bar graph using only the top 3 countries of each region and reversing the order
total_group = px.bar(total_region.groupby('region').head(3), x = 'total', y = 'name', color = 'region', title = 'Top 3 Countries with Highest Total Number of Tourism in Each Region', labels = {'total': 'Total Number of Tourism', 'name': 'Country'}, orientation = 'h').update_layout(yaxis = dict(autorange = 'reversed'))
total_group
#Plotting side-by-side boxplot with categorical scatterplot at the side using only the top 5 countries of each region
total_box = px.box(total_region.groupby('region').head(5), y = 'total', x = 'region', color = 'region', hover_data = total_region.columns, points = 'all', title = 'Boxplot of Top 5 Countries with Highest Total Number of Tourism in Each Region', labels = {'total': 'Total Number of Tourism', 'name': 'Country'})
total_box
#Plotting side-by-side violinplot
plt.figure(figsize=(15,10))
sns.violinplot(data = total_region, y = total_region['total'], x = total_region['region'], inner = 'quartile')
plt.show()
#Plotting side-by-side violinplot with categorical scatterplot at the side using only the top 5 countries of each region
total_violin = px.violin(total_region.groupby('region').head(5), y = 'total', x = 'region', color = 'region', box=True, points = 'all', hover_data = total_region.columns, title = 'Violinplot of Top 5 Countries with Highest Total Number of Tourism in Each Region', labels = {'total': 'Total Number of Tourism', 'name': 'Country'})
total_violin
Similar to the previous questions, I will be summing the outbound tourism, inbound tourism, and domestic tourism from 1995 to 2019. I will also be using similar figures.
#Creating map
total_map = folium.Map()
#Adding choropleth map
folium.Choropleth(geo_data = 'https://github.com/simonepri/geo-maps/releases/download/v0.6.0/countries-land-10km.geo.json',
data = total,
columns = ['country_alpha_3', 'total'],
key_on = 'feature.properties.A3',
fill_color = 'RdBu',
fill_opacity = 0.8,
line_opacity = 0.2,
legend_name = 'Total Number of Tourism from 1995 to 2019 in Millions'
).add_to(total_map)
total_map
#Adding markers
for i in range(0,len(total)):
folium.Marker(
location = [total.iloc[i]['latitude'], total.iloc[i]['longitude']],
popup = total.iloc[i]['name'] + '\n' + str(int(total.iloc[i]['total'] * 1000000))
).add_to(total_map)
total_map
#Dropping redundant data
tourism_total = total[['name', 'total']]
#Setting name as index
tourism_total = tourism_total.set_index('name')
#Multiplying by a million
tourism_total *= 1000000
#Dropping data <= 0
tourism_total = tourism_total[tourism_total['total'] > 0]
tourism_total
| total | |
|---|---|
| name | |
| China | 5.935089e+10 |
| United States of America | 5.732365e+10 |
| United Kingdom | 2.045216e+10 |
| India | 1.922805e+10 |
| Japan | 1.780012e+10 |
| ... | ... |
| Solomon Islands | 4.074000e+05 |
| Montserrat | 3.033050e+05 |
| Marshall Islands | 1.524000e+05 |
| Niue | 1.523000e+05 |
| Tuvalu | 9.920000e+04 |
200 rows × 1 columns
#Plotting vertical bar graph and setting the x-axis to show the top 15 countries by default
px.bar(tourism_total, title = 'Total Number of Tourism from 1995 to 2019', labels = {'value': 'Total Number of Tourism', 'name': 'Country'}).for_each_trace(lambda t: t.update(name = {'total': 'Total Number of Tourism'}[t.name])).update_layout(xaxis_range=[-0.5, 14.5])
#Plotting horizontal bar graph, reversing the order, and setting the y-axis to show the top 15 countries by default
total_bar = px.bar(tourism_total.sort_values(by = 'total', ascending = True), title = 'Total Number of Tourism from 1995 to 2019', labels = {'value': 'Total Number of Tourism', 'name': 'Country'}, orientation='h').for_each_trace(lambda t: t.update(name = {'total': 'Total Number of Tourism'}[t.name])).update_layout(yaxis_range=[len(tourism_total.index)-15.5, len(tourism_total.index)-0.5])
total_bar
#Dropping redundant data
total_stacked = total[['name', 'outbound', 'inbound', 'domestic', 'total']]
#Setting name and total_arrivals as the index
total_stacked = total_stacked.set_index(['name', 'total'])
#Stack the data and renaming the columns
total_stacked = total_stacked.stack().to_frame().reset_index().rename(columns = {'level_2': 'type', 0: 'sum'})
#Dropping redundant data
total_stacked = total_stacked[['name', 'type', 'sum', 'total']]
#Multiplying numerical data by a million
total_stacked[total_stacked.select_dtypes(include = ['number']).columns] *= 1000000
#Dropping data <= 0
total_stacked = total_stacked[total_stacked['total'] > 0]
total_stacked
| name | type | sum | total | |
|---|---|---|---|---|
| 0 | China | outbound | 1.456823e+09 | 5.935089e+10 |
| 1 | China | inbound | 2.805217e+09 | 5.935089e+10 |
| 2 | China | domestic | 5.508885e+10 | 5.935089e+10 |
| 3 | United States of America | outbound | 2.781920e+09 | 5.732365e+10 |
| 4 | United States of America | inbound | 3.205206e+09 | 5.732365e+10 |
| ... | ... | ... | ... | ... |
| 595 | Niue | inbound | 1.191000e+05 | 1.523000e+05 |
| 596 | Niue | domestic | 0.000000e+00 | 1.523000e+05 |
| 597 | Tuvalu | outbound | 6.050000e+04 | 9.920000e+04 |
| 598 | Tuvalu | inbound | 3.870000e+04 | 9.920000e+04 |
| 599 | Tuvalu | domestic | 0.000000e+00 | 9.920000e+04 |
600 rows × 4 columns
#Plotting vertical stacked bar graph and setting the x-axis to show the top 15 countries by default
px.bar(total_stacked, x = 'name', y = 'sum', color = 'type', title = 'Total Number of Tourism from 1995 to 2019', labels = {'value': 'Total Number of Tourism', 'name': 'Country'}).update_layout(xaxis_range=[-0.5, 14.5])
Here, I experimented with an x-axis rangeslider.
#Plotting vertical bar graph, setting an x-axis rangeslider, and setting the x-axis to show the top 15 countries by default
total_stacked_vbar = px.bar(total_stacked, x = 'name', y = 'sum', color = 'type', title = 'Total Number of Tourism from 1995 to 2019', labels = {'sum': 'Total Number of Tourism', 'name': 'Country'}).update_layout(xaxis_rangeslider_visible=True, xaxis_range=[-0.5, 14.5]).update_yaxes(fixedrange = False)
total_stacked_vbar
#Plotting horizontal stacked bar graph, reversing the order, and setting the y-axis to show the top 15 countries by default
total_stacked_hbar = px.bar(total_stacked.sort_values(by = 'total', ascending = True), x = 'sum', y = 'name', color = 'type', title = 'Total Number of Tourism from 1995 to 2019', labels = {'sum': 'Total Number of Tourism', 'name': 'Country'}, orientation = 'h').update_layout(yaxis_range=[len(total_stacked.index)/3.0-15.5, len(total_stacked.index)/3.0-0.5])
total_stacked_hbar
outbound_year = outbound.copy()
outbound_year = outbound_year.drop(['country_alpha_2','country_alpha_3','latitude','longitude','units'], axis = 1)
outbound_year = outbound_year.set_index(['name', 'region'])
outbound_year = outbound_year[np.arange(1995,2020)].stack().to_frame().reset_index()
outbound_year = outbound_year.rename(columns = {'level_2': 'year', 0: 'sum'})
outbound_year['sum'] *= 1000000
outbound_year
| name | region | year | sum | |
|---|---|---|---|---|
| 0 | United States of America | Americas | 1995 | 74031000.0 |
| 1 | United States of America | Americas | 1996 | 76803000.0 |
| 2 | United States of America | Americas | 1997 | 78481000.0 |
| 3 | United States of America | Americas | 1998 | 82758000.0 |
| 4 | United States of America | Americas | 1999 | 84540000.0 |
| ... | ... | ... | ... | ... |
| 3295 | Niue | Oceania | 2015 | 1600.0 |
| 3296 | Niue | Oceania | 2016 | 1600.0 |
| 3297 | Niue | Oceania | 2017 | 1600.0 |
| 3298 | Niue | Oceania | 2018 | 1600.0 |
| 3299 | Niue | Oceania | 2019 | 1600.0 |
3300 rows × 4 columns
top_outbound = outbound.nlargest(10, 'sum')
top_outbound = top_outbound.set_index('name')
top_outbound = top_outbound.drop(['country_alpha_2', 'country_alpha_3', 'region', 'latitude', 'longitude', 'units', 'sum'],axis=1)
top_outbound.columns.name = 'year'
top_outbound = top_outbound.T
top_outbound[top_outbound.select_dtypes(include=['number']).columns] *= 1000000
top_outbound
| name | United States of America | Mexico | Germany | Namibia | United Kingdom | China | Italy | Poland | Canada | Russia |
|---|---|---|---|---|---|---|---|---|---|---|
| year | ||||||||||
| 1995 | 74031000.0 | 103161000.0 | 55800000.0 | 47594000.0 | 41345000.0 | 4520000.0 | 18173000.0 | 36387000.0 | 18206000.0 | 21329000.0 |
| 1996 | 76803000.0 | 103442000.0 | 55800000.0 | 47594000.0 | 42050000.0 | 5061000.0 | 18173000.0 | 44713000.0 | 18973000.0 | 12260000.0 |
| 1997 | 78481000.0 | 107242000.0 | 55800000.0 | 47594000.0 | 45957000.0 | 5324000.0 | 40196000.0 | 48610000.0 | 19111000.0 | 11182000.0 |
| 1998 | 82758000.0 | 107927000.0 | 69200000.0 | 47594000.0 | 50872000.0 | 8426000.0 | 42431000.0 | 49328000.0 | 17648000.0 | 10635000.0 |
| 1999 | 84540000.0 | 117383000.0 | 78100000.0 | 53144000.0 | 53881000.0 | 9232000.0 | 42390000.0 | 55097000.0 | 18368000.0 | 12631000.0 |
| 2000 | 87973000.0 | 127268000.0 | 80507000.0 | 58901000.0 | 56837000.0 | 10473000.0 | 44628000.0 | 56677000.0 | 19182000.0 | 18371000.0 |
| 2001 | 84755000.0 | 123732000.0 | 81551000.0 | 61096000.0 | 58281000.0 | 12133000.0 | 43611000.0 | 53122000.0 | 18359000.0 | 18030000.0 |
| 2002 | 80883000.0 | 124633000.0 | 80393000.0 | 64540000.0 | 59377000.0 | 16602000.0 | 44660000.0 | 45043000.0 | 17705000.0 | 20428000.0 |
| 2003 | 75880000.0 | 123015000.0 | 85345000.0 | 60936000.0 | 61424000.0 | 20222000.0 | 46357000.0 | 38730000.0 | 17739000.0 | 20572000.0 |
| 2004 | 79655000.0 | 128903000.0 | 84859000.0 | 68903000.0 | 64194000.0 | 28853000.0 | 40400000.0 | 37226000.0 | 19595000.0 | 24507000.0 |
| 2005 | 79215000.0 | 128392000.0 | 86622000.0 | 72300000.0 | 66494000.0 | 31026000.0 | 43407000.0 | 40841000.0 | 21099000.0 | 28416000.0 |
| 2006 | 148511000.0 | 122022000.0 | 81801000.0 | 75812000.0 | 69536000.0 | 34524000.0 | 46369000.0 | 44696000.0 | 46912000.0 | 29107000.0 |
| 2007 | 140364000.0 | 109540000.0 | 82099000.0 | 80682000.0 | 69450000.0 | 40954000.0 | 49166000.0 | 47561000.0 | 50044000.0 | 34285000.0 |
| 2008 | 136148000.0 | 107519000.0 | 86201000.0 | 81911000.0 | 69011000.0 | 45844000.0 | 54421000.0 | 50243000.0 | 51737000.0 | 36538000.0 |
| 2009 | 129954000.0 | 98228000.0 | 85547000.0 | 81958000.0 | 63513000.0 | 47656000.0 | 54839000.0 | 39270000.0 | 47481000.0 | 34276000.0 |
| 2010 | 121574000.0 | 91658000.0 | 85872000.0 | 84442000.0 | 64647000.0 | 57386000.0 | 55304000.0 | 42760000.0 | 53620000.0 | 39323000.0 |
| 2011 | 114089000.0 | 88113000.0 | 84692000.0 | 84816000.0 | 67493000.0 | 70250000.0 | 52617000.0 | 43270000.0 | 61909000.0 | 43726000.0 |
| 2012 | 116329000.0 | 87332000.0 | 82729000.0 | 85276000.0 | 66858000.0 | 83182000.0 | 53338000.0 | 48290000.0 | 65175000.0 | 47813000.0 |
| 2013 | 118968000.0 | 90787000.0 | 87459000.0 | 84414000.0 | 68959000.0 | 98185000.0 | 52633000.0 | 52580000.0 | 65780000.0 | 54069000.0 |
| 2014 | 121699000.0 | 90982000.0 | 83008000.0 | 84519000.0 | 72204000.0 | 116593000.0 | 55169000.0 | 35400000.0 | 63737000.0 | 45889000.0 |
| 2015 | 130364000.0 | 94988000.0 | 83737000.0 | 89082000.0 | 77619000.0 | 127860000.0 | 57418000.0 | 44300000.0 | 55971000.0 | 34550000.0 |
| 2016 | 141526000.0 | 97372000.0 | 90966000.0 | 91758000.0 | 81757000.0 | 135130000.0 | 57480000.0 | 44500000.0 | 52979000.0 | 31659000.0 |
| 2017 | 148045000.0 | 94274000.0 | 92402000.0 | 91304000.0 | 87242000.0 | 143035000.0 | 60042000.0 | 46700000.0 | 54955000.0 | 39629000.0 |
| 2018 | 158445400.0 | 86280000.0 | 108542000.0 | 92214000.0 | 90571000.0 | 149720000.0 | 61194600.0 | 48600000.0 | 38069000.0 | 41964000.0 |
| 2019 | 170930100.0 | 82752000.0 | 99533000.0 | 94715000.0 | 93086000.0 | 154632000.0 | 62207000.0 | 50600000.0 | 37845900.0 | 45330000.0 |
#Plotting line graph
outbound_growth = px.line(top_outbound, title = 'Number of Outbound Tourism of the Top 10 Countries over the Years', labels = {'value': 'Number of Outbound Tourism', 'year': 'Year'}, markers = True).update_layout(legend_title = 'Countries')
outbound_growth
inbound_year = inbound.copy()
inbound_year = inbound_year.drop(['country_alpha_2','country_alpha_3','latitude','longitude','units'], axis = 1)
inbound_year = inbound_year.set_index(['name', 'region'])
inbound_year = inbound_year[np.arange(1995,2020)].stack().to_frame().reset_index()
inbound_year = inbound_year.rename(columns = {'level_2': 'year', 0: 'sum'})
inbound_year['sum'] *= 1000000
inbound_year
| name | region | year | sum | |
|---|---|---|---|---|
| 0 | France | Europe | 1995 | 60033000.0 |
| 1 | France | Europe | 1996 | 148263000.0 |
| 2 | France | Europe | 1997 | 157551000.0 |
| 3 | France | Europe | 1998 | 70109000.0 |
| 4 | France | Europe | 1999 | 73147000.0 |
| ... | ... | ... | ... | ... |
| 4995 | Tuvalu | Oceania | 2015 | 2400.0 |
| 4996 | Tuvalu | Oceania | 2016 | 2500.0 |
| 4997 | Tuvalu | Oceania | 2017 | 2500.0 |
| 4998 | Tuvalu | Oceania | 2018 | 3100.0 |
| 4999 | Tuvalu | Oceania | 2019 | 3700.0 |
5000 rows × 4 columns
#Plotting line graph
px.line(inbound_year, x = 'year', y = 'sum', color='region', line_group='name', title = 'Total Number of Inbound Tourism from 1995 to 2019', labels = {'sum': 'Number of Inbound Tourism', 'year': 'Year'}, markers = True).update_layout(legend_title = 'Region')
#Plotting area graph
px.area(inbound_year, x = 'year', y = 'sum', color='region', line_group='name', title = 'Total Number of Inbound Tourism from 1995 to 2019', labels = {'sum': 'Number of Inbound Tourism', 'year': 'Year'}).update_layout(legend_title = 'Region')
top_inbound = inbound.nlargest(10, 'sum')
top_inbound = top_inbound.set_index('name')
top_inbound = top_inbound.drop(['country_alpha_2', 'country_alpha_3', 'region', 'latitude', 'longitude', 'units', 'sum'],axis=1)
top_inbound.columns.name = 'year'
top_inbound = top_inbound.T
top_inbound[top_inbound.select_dtypes(include=['number']).columns] *= 1000000
top_inbound
| name | France | United States of America | China | Mexico | Spain | Poland | Italy | Croatia | Hungary | Canada |
|---|---|---|---|---|---|---|---|---|---|---|
| year | ||||||||||
| 1995 | 60033000.0 | 79732000.0 | 46387000.0 | 85446000.0 | 52460000.0 | 82244000.0 | 55706000.0 | 16100000.0 | 39240000.0 | 41657000.0 |
| 1996 | 148263000.0 | 82756000.0 | 51128000.0 | 90394000.0 | 55077000.0 | 87439000.0 | 59805000.0 | 19085000.0 | 39833000.0 | 43256000.0 |
| 1997 | 157551000.0 | 82525000.0 | 57588000.0 | 92915000.0 | 62415000.0 | 87817000.0 | 57998000.0 | 23660000.0 | 37315000.0 | 45076000.0 |
| 1998 | 70109000.0 | 74767000.0 | 63478000.0 | 95214000.0 | 68068000.0 | 88592000.0 | 58499000.0 | 25499000.0 | 33624000.0 | 48064000.0 |
| 1999 | 73147000.0 | 75796000.0 | 72796000.0 | 99869000.0 | 72040000.0 | 89118000.0 | 59521000.0 | 29215000.0 | 28803000.0 | 49055000.0 |
| 2000 | 77190000.0 | 78343000.0 | 83444000.0 | 105673000.0 | 74580000.0 | 84515000.0 | 62702000.0 | 37226000.0 | 31141000.0 | 48638000.0 |
| 2001 | 75202000.0 | 70975000.0 | 89013000.0 | 100718000.0 | 75564000.0 | 61431000.0 | 60960000.0 | 40129000.0 | 30679000.0 | 47147000.0 |
| 2002 | 77012000.0 | 64434000.0 | 97908000.0 | 100153000.0 | 79313000.0 | 50735000.0 | 63561000.0 | 41737000.0 | 31739000.0 | 44896000.0 |
| 2003 | 75048000.0 | 62082000.0 | 91662000.0 | 92330000.0 | 82326000.0 | 52130000.0 | 63026000.0 | 42857000.0 | 31412000.0 | 38903000.0 |
| 2004 | 190282000.0 | 67606000.0 | 109038000.0 | 99250000.0 | 85981000.0 | 61918000.0 | 58480000.0 | 44974000.0 | 33934000.0 | 38845000.0 |
| 2005 | 185829000.0 | 71484000.0 | 120292000.0 | 103146000.0 | 92563000.0 | 64606000.0 | 59230000.0 | 45762000.0 | 36173000.0 | 36160000.0 |
| 2006 | 193882000.0 | 183178000.0 | 124942000.0 | 97701000.0 | 96152000.0 | 65115000.0 | 66353000.0 | 47733000.0 | 38318000.0 | 33390000.0 |
| 2007 | 193319000.0 | 175299000.0 | 131873000.0 | 93582000.0 | 98907000.0 | 66208000.0 | 70271000.0 | 52271000.0 | 39379000.0 | 30373000.0 |
| 2008 | 193571000.0 | 175703000.0 | 130027000.0 | 92948000.0 | 97670000.0 | 59935000.0 | 70719000.0 | 51336000.0 | 39554000.0 | 27370000.0 |
| 2009 | 192369000.0 | 160508000.0 | 126476000.0 | 88044000.0 | 91899000.0 | 53840000.0 | 71692000.0 | 47573000.0 | 40624000.0 | 24696000.0 |
| 2010 | 189826000.0 | 162275000.0 | 133762000.0 | 81953000.0 | 93744000.0 | 58340000.0 | 73225000.0 | 49006000.0 | 39904000.0 | 25621000.0 |
| 2011 | 196595000.0 | 147271416.0 | 135423000.0 | 75732000.0 | 99187000.0 | 60745000.0 | 75866000.0 | 49969000.0 | 41304000.0 | 25066000.0 |
| 2012 | 197522000.0 | 171629897.0 | 132405000.0 | 76749000.0 | 98128000.0 | 67390000.0 | 76293000.0 | 47185000.0 | 43565000.0 | 25318000.0 |
| 2013 | 204410000.0 | 179309907.0 | 129078000.0 | 78100000.0 | 103231000.0 | 72310000.0 | 76762000.0 | 48345000.0 | 43611000.0 | 25167000.0 |
| 2014 | 206599000.0 | 178311354.0 | 128499000.0 | 81042000.0 | 107144000.0 | 73750000.0 | 77694000.0 | 51168000.0 | 45984000.0 | 25558000.0 |
| 2015 | 203302000.0 | 176864526.0 | 133820000.0 | 87129000.0 | 109834000.0 | 77743000.0 | 81068000.0 | 55858000.0 | 48345000.0 | 27555000.0 |
| 2016 | 203042000.0 | 175261488.0 | 141774000.0 | 94853000.0 | 115561000.0 | 80476000.0 | 84925000.0 | 57587000.0 | 52890000.0 | 30142000.0 |
| 2017 | 207274000.0 | 174291746.0 | 153260000.0 | 99349000.0 | 121717000.0 | 83804000.0 | 89931000.0 | 59238000.0 | 54962000.0 | 31081000.0 |
| 2018 | 211998000.0 | 169324918.0 | 158606000.0 | 96497000.0 | 124456000.0 | 85946000.0 | 93228600.0 | 57668000.0 | 57667000.0 | 31274000.0 |
| 2019 | 217877000.0 | 165478000.0 | 162538000.0 | 97406000.0 | 126170000.0 | 88515000.0 | 95399000.0 | 60021000.0 | 61397000.0 | 32430000.0 |
I initially used seaborn but prefered plotly.
plt.figure(figsize=(15,5), dpi=300)
sns.lineplot(data = top_inbound, dashes = False)
plt.legend(bbox_to_anchor=(1.01, 1.05))
plt.title('Number of Inbound Tourism of the Top 10 Countries over the Years')
plt.xlabel('Year')
plt.ylabel('Number of Inbound Tourism')
plt.show()
#Plotting line graph
inbound_growth = px.line(top_inbound, title = 'Number of Inbound Tourism of the Top 10 Countries over the Years', labels = {'value': 'Number of Inbound Tourism', 'year': 'Year'}, markers = True).update_layout(legend_title = 'Countries')
inbound_growth
domestic_year = domestic.copy()
domestic_year = domestic_year.drop(['country_alpha_2','country_alpha_3','latitude','longitude','units'], axis = 1)
domestic_year = domestic_year.set_index(['name', 'region'])
domestic_year = domestic_year[np.arange(1995,2020)].stack().to_frame().reset_index()
domestic_year = domestic_year.rename(columns = {'level_2': 'year', 0: 'sum'})
domestic_year['sum'] *= 1000000
domestic_year
| name | region | year | sum | |
|---|---|---|---|---|
| 0 | China | Asia | 1995 | 629000000.0 |
| 1 | China | Asia | 1996 | 640000000.0 |
| 2 | China | Asia | 1997 | 644000000.0 |
| 3 | China | Asia | 1998 | 695000000.0 |
| 4 | China | Asia | 1999 | 719000000.0 |
| ... | ... | ... | ... | ... |
| 2095 | Mali | Africa | 2015 | 31000.0 |
| 2096 | Mali | Africa | 2016 | 26000.0 |
| 2097 | Mali | Africa | 2017 | 24000.0 |
| 2098 | Mali | Africa | 2018 | 24500.0 |
| 2099 | Mali | Africa | 2019 | 23000.0 |
2100 rows × 4 columns
top_domestic = domestic.nlargest(10, 'sum')
top_domestic = top_domestic.set_index('name')
top_domestic = top_domestic.drop(['country_alpha_2', 'country_alpha_3', 'region', 'latitude', 'longitude', 'units', 'sum'],axis=1)
top_domestic.columns.name = 'year'
top_domestic = top_domestic.T
top_domestic[top_domestic.select_dtypes(include=['number']).columns] *= 1000000
top_domestic
| name | China | United States of America | India | United Kingdom | Japan | Spain | Canada | Indonesia | France | Australia |
|---|---|---|---|---|---|---|---|---|---|---|
| year | ||||||||||
| 1995 | 6.290000e+08 | 2.004500e+09 | 1.366440e+08 | 1.260100e+08 | 734558000.0 | 1.154560e+08 | 208165000.0 | 200589000.0 | 210755000.0 | 243920000.0 |
| 1996 | 6.400000e+08 | 2.004500e+09 | 1.401200e+08 | 1.260100e+08 | 734558000.0 | 1.154560e+08 | 208165000.0 | 200589000.0 | 210755000.0 | 243920000.0 |
| 1997 | 6.440000e+08 | 2.004500e+09 | 1.598770e+08 | 1.260100e+08 | 734558000.0 | 1.154560e+08 | 208165000.0 | 200589000.0 | 210755000.0 | 243920000.0 |
| 1998 | 6.950000e+08 | 2.004500e+09 | 1.681960e+08 | 1.260100e+08 | 734558000.0 | 1.154560e+08 | 208165000.0 | 200589000.0 | 210755000.0 | 243920000.0 |
| 1999 | 7.190000e+08 | 2.004500e+09 | 1.906710e+08 | 1.260100e+08 | 734558000.0 | 1.154560e+08 | 208165000.0 | 200589000.0 | 210755000.0 | 243920000.0 |
| 2000 | 7.440000e+08 | 2.004500e+09 | 2.201070e+08 | 1.260100e+08 | 734558000.0 | 3.710172e+08 | 208165000.0 | 200589000.0 | 210755000.0 | 235235000.0 |
| 2001 | 7.840000e+08 | 2.004500e+09 | 2.364700e+08 | 1.260100e+08 | 734558000.0 | 4.052927e+08 | 208165000.0 | 200589000.0 | 210755000.0 | 220593000.0 |
| 2002 | 8.780000e+08 | 2.004500e+09 | 2.695980e+08 | 1.260100e+08 | 734558000.0 | 4.029999e+08 | 208165000.0 | 200589000.0 | 210755000.0 | 217472000.0 |
| 2003 | 8.700000e+08 | 2.004500e+09 | 3.090380e+08 | 1.260100e+08 | 734558000.0 | 4.326621e+08 | 208165000.0 | 200589000.0 | 210755000.0 | 212681000.0 |
| 2004 | 1.102000e+09 | 2.004500e+09 | 3.662680e+08 | 1.260100e+08 | 734558000.0 | 4.156371e+08 | 208165000.0 | 200589000.0 | 210755000.0 | 203869000.0 |
| 2005 | 1.212000e+09 | 2.004500e+09 | 3.920140e+08 | 1.260100e+08 | 734558000.0 | 1.570054e+08 | 208165000.0 | 200589000.0 | 210755000.0 | 200044000.0 |
| 2006 | 1.394000e+09 | 2.004500e+09 | 4.623210e+08 | 1.260100e+08 | 734558000.0 | 1.549684e+08 | 208165000.0 | 204553000.0 | 284337000.0 | 208028000.0 |
| 2007 | 1.610000e+09 | 2.004500e+09 | 5.265640e+08 | 1.260100e+08 | 734558000.0 | 3.555545e+08 | 214559000.0 | 222389000.0 | 288609000.0 | 223980000.0 |
| 2008 | 1.712000e+09 | 1.964900e+09 | 5.630340e+08 | 1.260100e+08 | 734558000.0 | 3.929012e+08 | 214498000.0 | 225041000.0 | 278950000.0 | 210754000.0 |
| 2009 | 1.902000e+09 | 1.900100e+09 | 6.688000e+08 | 1.260100e+08 | 702896000.0 | 3.652531e+08 | 227121000.0 | 229730000.0 | 278275000.0 | 215846000.0 |
| 2010 | 2.103000e+09 | 1.963700e+09 | 7.477000e+08 | 1.194340e+08 | 631596000.0 | 3.544248e+08 | 229158000.0 | 234377000.0 | 268041000.0 | 225239000.0 |
| 2011 | 2.641000e+09 | 1.998500e+09 | 8.645330e+08 | 1.668640e+09 | 612525000.0 | 3.476951e+08 | 317021000.0 | 236751000.0 | 276752000.0 | 233127000.0 |
| 2012 | 2.957000e+09 | 2.030300e+09 | 1.045050e+09 | 1.836020e+09 | 612750000.0 | 3.728110e+08 | 316254000.0 | 245290000.0 | 268673000.0 | 248377000.0 |
| 2013 | 3.262000e+09 | 2.059600e+09 | 1.142529e+09 | 1.710905e+09 | 630950000.0 | 3.984230e+08 | 320266300.0 | 250036000.0 | 265182000.0 | 240118000.0 |
| 2014 | 3.611000e+09 | 2.109300e+09 | 1.282802e+09 | 1.698942e+09 | 595221000.0 | 4.627610e+08 | 318208700.0 | 251237000.0 | 266027000.0 | 260362000.0 |
| 2015 | 3.990000e+09 | 2.178700e+09 | 1.431974e+09 | 1.649626e+09 | 604715000.0 | 3.722650e+08 | 315745700.0 | 256419000.0 | 256078000.0 | 269481000.0 |
| 2016 | 4.435000e+09 | 2.206500e+09 | 1.615389e+09 | 1.953655e+09 | 641079000.0 | 3.971340e+08 | 319315000.0 | 264338000.0 | 255498000.0 | 280325000.0 |
| 2017 | 5.010000e+09 | 2.248700e+09 | 1.657546e+09 | 1.914076e+09 | 647510000.0 | 4.483050e+08 | 325808200.0 | 270822000.0 | 276537000.0 | 291797000.0 |
| 2018 | 5.539000e+09 | 2.291100e+09 | 1.853788e+09 | 1.821956e+09 | 561779000.0 | 4.464790e+08 | 278060000.0 | 303403000.0 | 268152000.0 | 310166000.0 |
| 2019 | 6.005852e+09 | 2.326623e+09 | 2.321983e+09 | 1.776080e+09 | 587103000.0 | 4.235720e+08 | 275418000.0 | 722159000.0 | 260522000.0 | 365797000.0 |
#Plotting line graph
domestic_growth = px.line(top_domestic, title = 'Number of Domestic Tourism of the Top 10 Countries over the Years', labels = {'value': 'Number of Domestic Tourism', 'year': 'Year'}, markers = True).update_layout(legend_title = 'Countries')
domestic_growth
Here I am using the rankings of countries in 2019 to compare with inbound tourism in 2019.
inbound_19 = inbound[['name',2019]].dropna().sort_values(by = 2019, ascending = False).reset_index(drop = True).reset_index().drop(2019, axis = 1)
inbound_19 = inbound_19.rename(columns = {'index': 'Inbound Tourism'})
inbound_19['Inbound Tourism'] += 1
inbound_19 = pd.merge(ranking_19, inbound_19, left_on = 'Country', right_on = 'name', how = 'inner').drop('name', axis = 1)
inbound_19
| Overall Rank | Country | Entrepreneurship | Adventure | Citizenship | Cultural Influence | Heritage | Movers | Open for Business | Power | Quality of Life | Inbound Tourism | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Switzerland | 4 | 17 | 3 | 9 | 27 | 25 | 2 | 14 | 5 | 41 |
| 1 | 2 | Japan | 1 | 39 | 17 | 6 | 10 | 5 | 22 | 7 | 13 | 21 |
| 2 | 3 | Canada | 6 | 19 | 2 | 12 | 42 | 39 | 7 | 12 | 1 | 19 |
| 3 | 4 | Germany | 2 | 57 | 12 | 11 | 20 | 34 | 21 | 4 | 10 | 14 |
| 4 | 5 | United Kingdom | 5 | 40 | 11 | 5 | 12 | 53 | 23 | 5 | 12 | 12 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 68 | 71 | Jordan | 62 | 71 | 74 | 70 | 48 | 50 | 63 | 33 | 77 | 66 |
| 69 | 72 | Tunisia | 69 | 60 | 76 | 65 | 53 | 63 | 55 | 63 | 68 | 45 |
| 70 | 73 | Belarus | 56 | 61 | 50 | 71 | 67 | 66 | 76 | 35 | 67 | 40 |
| 71 | 74 | Nigeria | 67 | 74 | 77 | 63 | 76 | 57 | 58 | 46 | 74 | 68 |
| 72 | 75 | Pakistan | 68 | 77 | 78 | 79 | 71 | 56 | 72 | 22 | 73 | 132 |
73 rows × 12 columns
Firstly, I will plot a pairplot to see the correlation between various rankings and ranking of inbound tourism.
inbound_ranking = sns.pairplot(inbound_19, y_vars=['Inbound Tourism']).fig
We can see that Heritage and Cultural Influence have the most positive correlation. Hence, I will be further examining them.
sns.regplot(data = inbound_19, x = 'Heritage', y = 'Inbound Tourism')
plt.show()
sns.jointplot(data = inbound_19, x = 'Heritage', y = 'Inbound Tourism')
plt.show()
sns.jointplot(data = inbound_19, x = 'Heritage', y = 'Inbound Tourism', kind = 'hex')
plt.show()
sns.jointplot(data = inbound_19, x = 'Heritage', y = 'Inbound Tourism', kind = 'kde')
plt.show()
inbound_heritage_contour = px.density_contour(inbound_19, x = 'Heritage', y = 'Inbound Tourism', title = '2D Histogram Contour Plot between the Rankings for Inbound Tourism and Heritage of Countries in 2019').update_traces(contours_coloring="fill", contours_showlabels = True)
inbound_heritage_contour
inbound_heritage = px.scatter(inbound_19, x = 'Heritage', y = 'Inbound Tourism', trendline='ols', trendline_color_override='darkblue', title = 'The Correlation between the Rankings for Inbound Tourism and Heritage of Countries in 2019')
inbound_heritage
heritage_coef, heritage_value = stats.pearsonr(inbound_19['Heritage'], inbound_19['Inbound Tourism'])
heritage_coef, heritage_value
(0.6676240739225271, 1.1103724933546565e-10)
inbound_culture_contour = px.density_contour(inbound_19, x = 'Cultural Influence', y = 'Inbound Tourism', title = '2D Histogram Contour Plot between the Rankings for Inbound Tourism and Cultural Influence of Countries in 2019').update_traces(contours_coloring="fill", contours_showlabels = True)
inbound_culture_contour
inbound_culture = px.scatter(inbound_19, x = 'Cultural Influence', y = 'Inbound Tourism', trendline='ols', trendline_color_override='darkblue', title = 'The Correlation between the Rankings for Inbound Tourism and Cultural Influence of Countries in 2019')
inbound_culture
culture_coef, culture_value = stats.pearsonr(inbound_19['Cultural Influence'], inbound_19['Inbound Tourism'])
culture_coef, culture_value
(0.4942019692089541, 8.846681908836526e-06)
Since heritage has the most correlation, I will use it to form a linear regression model.
lm = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(inbound_19[['Heritage']], inbound_19['Inbound Tourism'], test_size=0.2, random_state=0)
lm.fit(X_train, y_train)
yhat = lm.predict(X_test)
yhat
sns.kdeplot(y_test, color='r', label='Actual Value')
sns.kdeplot(yhat, color='b', label='Fitted Value')
plt.legend()
<matplotlib.legend.Legend at 0x1e9dead3970>
lm.score(X_train, y_train)
0.44387460785374466
The model is decent but overpredicts near the middle.
Now, I will use all the variables to form a multiple linear regression model.
mlm = LinearRegression()
X_train2, X_test2, y_train2, y_test2 = train_test_split(inbound_19[['Entrepreneurship','Adventure','Citizenship','Cultural Influence','Heritage','Movers','Open for Business','Power','Quality of Life']], inbound_19['Inbound Tourism'], test_size=0.2, random_state=0)
mlm.fit(X_train2, y_train2)
yhat2 = mlm.predict(X_test2)
sns.kdeplot(y_test2, color='r', label='Actual Value')
sns.kdeplot(yhat2, color='b', label='Fitted Value')
plt.legend()
<matplotlib.legend.Legend at 0x1e9de7c9a60>
mlm.score(X_train2, y_train2)
0.5532718300044949
Compared to before, this model is more accurate, but misses a dent in the middle.
Aside from that, I will be finding correlation between inbound tourism and interest over time. I got the interest over time from Google Trends.
year = countries_interest.reset_index()
year.Date = year.Date.astype('string').str[:4].astype('int')
year = year.groupby('Date').sum()
year = year.loc[np.arange(2004,2020)]
year = year.T.reset_index()
year.columns.name = ''
year[np.arange(2004,2020)] = year[np.arange(2004,2020)].div(year[np.arange(2004,2020)].max(axis = 1),axis = 0) * 100
year
| index | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 73.684211 | 61.403509 | 61.403509 | 55.263158 | 61.403509 | 91.228070 | 100.000000 | 84.210526 | 80.701754 | 64.912281 | 59.649123 | 54.385965 | 56.140351 | 57.017544 | 64.912281 | 92.105263 |
| 1 | Åland Islands | 100.000000 | 16.358025 | 85.493827 | 24.382716 | 20.370370 | 32.716049 | 30.864198 | 37.037037 | 35.802469 | 45.370370 | 31.790123 | 45.679012 | 46.913580 | 45.679012 | 48.148148 | 49.691358 |
| 2 | Albania | 61.878453 | 61.049724 | 54.972376 | 59.944751 | 70.441989 | 77.348066 | 74.033149 | 74.033149 | 70.994475 | 70.441989 | 80.662983 | 78.729282 | 96.685083 | 88.397790 | 85.911602 | 100.000000 |
| 3 | American Samoa | 99.076923 | 84.615385 | 68.615385 | 62.769231 | 63.692308 | 100.000000 | 66.769231 | 53.846154 | 57.846154 | 45.230769 | 47.692308 | 53.230769 | 62.461538 | 50.461538 | 52.923077 | 56.923077 |
| 4 | Andorra | 100.000000 | 90.643275 | 76.842105 | 65.964912 | 56.374269 | 52.280702 | 46.432749 | 40.701754 | 39.064327 | 39.532164 | 36.959064 | 40.116959 | 37.076023 | 40.818713 | 39.766082 | 50.643275 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 222 | United Kingdom of Great Britain and Northern I... | 33.590734 | 93.050193 | 71.235521 | 76.254826 | 75.289575 | 82.625483 | 79.343629 | 67.181467 | 86.872587 | 91.505792 | 89.189189 | 86.293436 | 92.084942 | 94.594595 | 100.000000 | 88.610039 |
| 223 | United States of America | 100.000000 | 76.660342 | 56.925996 | 55.977230 | 57.495256 | 44.402277 | 39.848197 | 34.345351 | 32.637571 | 29.222011 | 29.222011 | 30.740038 | 27.514231 | 27.134725 | 43.453510 | 24.667932 |
| 224 | United States Minor Outlying Islands | 93.750000 | 41.666667 | 100.000000 | 71.875000 | 55.208333 | 68.229167 | 58.333333 | 53.645833 | 51.041667 | 47.395833 | 58.854167 | 55.729167 | 64.583333 | 60.937500 | 65.625000 | 83.333333 |
| 225 | Viet Nam | 15.515409 | 15.515409 | 14.877790 | 19.659936 | 34.112646 | 71.307120 | 96.811902 | 100.000000 | 91.710946 | 80.340064 | 86.078640 | 80.871413 | 70.350691 | 51.115834 | 47.608927 | 39.319872 |
| 226 | Zimbabwe | 77.339901 | 75.615764 | 61.330049 | 63.793103 | 85.221675 | 62.068966 | 61.330049 | 69.704433 | 70.197044 | 84.729064 | 80.788177 | 100.000000 | 96.551724 | 99.753695 | 94.581281 | 80.788177 |
227 rows × 17 columns
np.arange(2004,2020).tolist().append(1)
inbound_interest = inbound.copy()
inbound_interest[np.arange(2004,2020)] = inbound_interest[np.arange(2004,2020)].div(inbound_interest[np.arange(2004,2020)].max(axis=1), axis=0) * 100
inbound_interest = inbound_interest.drop(['country_alpha_2', 'country_alpha_3', 'region', 'latitude', 'longitude', 'units', 'sum', 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003], axis = 1)
inbound_interest = inbound_interest.dropna()
inbound_interest
| name | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | France | 87.334597 | 85.290783 | 88.986905 | 88.728503 | 88.844164 | 88.292477 | 87.125305 | 90.232103 | 90.657573 | 93.818990 | 94.823685 | 93.310446 | 93.191112 | 95.133493 | 97.301689 | 100.000000 |
| 1 | United States of America | 36.907271 | 39.024337 | 100.000000 | 95.698719 | 95.919270 | 87.624060 | 88.588695 | 80.397982 | 93.695693 | 97.888342 | 97.343215 | 96.553367 | 95.678241 | 95.148842 | 92.437366 | 90.337268 |
| 2 | China | 67.084620 | 74.008540 | 76.869409 | 81.133643 | 79.997908 | 77.813188 | 82.295832 | 83.317747 | 81.460951 | 79.414045 | 79.057820 | 82.331516 | 87.225141 | 94.291796 | 97.580873 | 100.000000 |
| 3 | Mexico | 96.222830 | 100.000000 | 94.721075 | 90.727706 | 90.113044 | 85.358618 | 79.453396 | 73.422139 | 74.408121 | 75.717914 | 78.570182 | 84.471526 | 91.959940 | 96.318810 | 93.553798 | 94.435073 |
| 4 | Spain | 68.146945 | 73.363716 | 76.208290 | 78.391852 | 77.411429 | 72.837442 | 74.299754 | 78.613775 | 77.774431 | 81.818974 | 84.920346 | 87.052390 | 91.591504 | 96.470635 | 98.641515 | 100.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 195 | São Tomé and Príncipe | 30.372493 | 45.272206 | 35.243553 | 33.810888 | 41.547278 | 43.553009 | 22.922636 | 34.957020 | 34.957020 | 34.957020 | 34.957020 | 73.352436 | 82.808023 | 82.808023 | 95.702006 | 100.000000 |
| 196 | Montserrat | 78.756477 | 67.875648 | 49.222798 | 45.077720 | 43.523316 | 37.823834 | 39.896373 | 38.341969 | 51.321244 | 45.077720 | 55.440415 | 68.911917 | 69.948187 | 96.373057 | 86.528497 | 100.000000 |
| 197 | Marshall Islands | 97.826087 | 100.000000 | 63.043478 | 78.260870 | 65.217391 | 58.695652 | 50.000000 | 50.000000 | 50.000000 | 50.000000 | 53.260870 | 68.478261 | 70.652174 | 85.869565 | 73.913043 | 66.304348 |
| 198 | Niue | 22.413793 | 24.137931 | 25.862069 | 30.172414 | 40.517241 | 40.517241 | 53.448276 | 52.586207 | 43.103448 | 60.344828 | 63.793103 | 66.379310 | 76.724138 | 100.000000 | 90.517241 | 87.931034 |
| 199 | Tuvalu | 35.135135 | 29.729730 | 29.729730 | 29.729730 | 45.945946 | 43.243243 | 45.945946 | 32.432432 | 29.729730 | 35.135135 | 37.837838 | 64.864865 | 67.567568 | 67.567568 | 83.783784 | 100.000000 |
200 rows × 17 columns
Here is a small multiples plot of inbound tourism and interest over time. I did not give it a title as there will be a white space.
nrow = 40
ncol = 5
interest = plt.figure(figsize = (15,80))
gs = interest.add_gridspec(nrow, ncol, hspace = 0.5, wspace = 0)
axes = gs.subplots(sharex = True, sharey = True)
count=0
for r in range(nrow):
for c in range(ncol):
inbound_interest.set_index('name').iloc[count].plot(title = inbound_interest.set_index('name').iloc[count].name, ax = axes[r,c], ylim = (0,100))
try:
year.set_index('index').loc[inbound_interest.set_index('name').iloc[count].name].plot(title = inbound_interest.set_index('name').iloc[count].name, ax = axes[r,c], ylim = (0,100))
except:
pass
count += 1
plt.show()
I randomly chose 3 countries to test for any correlation: France, Albania, and China.
inbound_interest[inbound_interest.name == 'France'][np.arange(2004,2020)].T.rename(columns = {0: 'inbound tourism'}).plot()
year[year['index'] == 'France'].drop('index', axis= 1).T.rename(columns = {70: 'interest over time'}).plot()
<AxesSubplot:>
france_interest = pd.merge(inbound_interest[inbound_interest.name == 'France'][np.arange(2004,2020)].T.rename(columns = {0: 'inbound tourism'}), year[year['index'] == 'France'].drop('index', axis= 1).T.rename(columns = {70: 'interest over time'}), left_index=True, right_index=True)
france_interest
| inbound tourism | interest over time | |
|---|---|---|
| 2004 | 87.334597 | 100.000000 |
| 2005 | 85.290783 | 99.184783 |
| 2006 | 88.986905 | 95.923913 |
| 2007 | 88.728503 | 93.206522 |
| 2008 | 88.844164 | 87.092391 |
| 2009 | 88.292477 | 93.342391 |
| 2010 | 87.125305 | 92.119565 |
| 2011 | 90.232103 | 75.271739 |
| 2012 | 90.657573 | 66.983696 |
| 2013 | 93.818990 | 63.994565 |
| 2014 | 94.823685 | 61.413043 |
| 2015 | 93.310446 | 57.336957 |
| 2016 | 93.191112 | 51.358696 |
| 2017 | 95.133493 | 50.000000 |
| 2018 | 97.301689 | 54.483696 |
| 2019 | 100.000000 | 50.407609 |
inbound_interest[inbound_interest.name == 'China'][np.arange(2004,2020)].T.rename(columns = {2: 'inbound tourism'}).plot()
year[year['index'] == 'China'].drop('index', axis= 1).T.rename(columns = {39: 'interest over time'}).plot()
<AxesSubplot:>
france_interest_over_time = px.line(france_interest, title = 'Percentage of Inbound Tourism and Interest Over Time of France', labels = {'value': 'Percentage', 'index': 'Year'}, markers = True)
france_interest_over_time
albania_interest = pd.merge(inbound_interest[inbound_interest.name == 'Albania'][np.arange(2004,2020)].T.rename(columns = {79: 'inbound tourism'}), year[year['index'] == 'Albania'].drop('index', axis= 1).T.rename(columns = {2: 'interest over time'}), left_index=True, right_index=True)
albania_interest
| inbound tourism | interest over time | |
|---|---|---|
| 2004 | 10.068686 | 61.878453 |
| 2005 | 11.676553 | 61.049724 |
| 2006 | 14.626912 | 54.972376 |
| 2007 | 17.592882 | 59.944751 |
| 2008 | 22.166719 | 70.441989 |
| 2009 | 28.972838 | 77.348066 |
| 2010 | 37.730253 | 74.033149 |
| 2011 | 45.769591 | 74.033149 |
| 2012 | 54.854824 | 70.994475 |
| 2013 | 50.827349 | 70.441989 |
| 2014 | 57.336872 | 80.662983 |
| 2015 | 64.486419 | 78.729282 |
| 2016 | 73.930690 | 96.685083 |
| 2017 | 79.893850 | 88.397790 |
| 2018 | 92.522635 | 85.911602 |
| 2019 | 100.000000 | 100.000000 |
albania_interest_over_time = px.line(albania_interest, title = 'Percentage of Inbound Tourism and Interest Over Time of Albania', labels = {'value': 'Percentage', 'index': 'Year'}, markers = True)
albania_interest_over_time
china_interest = pd.merge(inbound_interest[inbound_interest.name == 'China'][np.arange(2004,2020)].T.rename(columns = {2: 'inbound tourism'}), year[year['index'] == 'China'].drop('index', axis= 1).T.rename(columns = {39: 'interest over time'}), left_index=True, right_index=True)
china_interest
| inbound tourism | interest over time | |
|---|---|---|
| 2004 | 67.084620 | 100.000000 |
| 2005 | 74.008540 | 99.545455 |
| 2006 | 76.869409 | 87.878788 |
| 2007 | 81.133643 | 82.878788 |
| 2008 | 79.997908 | 83.484848 |
| 2009 | 77.813188 | 71.818182 |
| 2010 | 82.295832 | 75.606061 |
| 2011 | 83.317747 | 73.030303 |
| 2012 | 81.460951 | 71.666667 |
| 2013 | 79.414045 | 69.696970 |
| 2014 | 79.057820 | 66.666667 |
| 2015 | 82.331516 | 69.848485 |
| 2016 | 87.225141 | 69.545455 |
| 2017 | 94.291796 | 71.969697 |
| 2018 | 97.580873 | 71.818182 |
| 2019 | 100.000000 | 72.878788 |
china_interest_over_time = px.line(china_interest, title = 'Percentage of Inbound Tourism and Interest Over Time of China', labels = {'value': 'Percentage', 'index': 'Year'}, markers = True)
china_interest_over_time
Finally, I want to test for any correlation between the area of countries and their inbound tourism. This is because it seems logical that a larger country should have more inbound tourism.
trend_area = inbound[['name', 'region', 'country_alpha_3', 2019]].dropna().sort_values(by = 2019, ascending = False).reset_index(drop = True)
trend_area[2019] *= 1000000
trend_area = pd.merge(trend_area, area[['Country Code', 2019.0]], left_on = 'country_alpha_3', right_on = 'Country Code')
trend_area = trend_area.rename(columns = {'2019_x': 'inbound', '2019.0_y': 'area'})
trend_area = trend_area[(trend_area['area'] < 3000000)]
trend_area
| name | region | country_alpha_3 | inbound | Country Code | area | |
|---|---|---|---|---|---|---|
| 0 | France | Europe | FRA | 217877000.0 | FRA | 547557.000 |
| 3 | Spain | Europe | ESP | 126170000.0 | ESP | 499570.036 |
| 4 | Mexico | Americas | MEX | 97406000.0 | MEX | 1943950.000 |
| 5 | Italy | Europe | ITA | 95399000.0 | ITA | 295717.000 |
| 6 | Poland | Europe | POL | 88515000.0 | POL | 306110.000 |
| ... | ... | ... | ... | ... | ... | ... |
| 189 | Micronesia | Oceania | FSM | 18000.0 | FSM | 700.000 |
| 190 | Kiribati | Oceania | KIR | 12000.0 | KIR | 810.000 |
| 191 | Turkmenistan | Asia | TKM | 8200.0 | TKM | 469930.000 |
| 192 | Marshall Islands | Oceania | MHL | 6100.0 | MHL | 180.000 |
| 193 | Tuvalu | Oceania | TUV | 3700.0 | TUV | 30.000 |
188 rows × 6 columns
I chose to analyse the trend between different regions as it accounts for countries' location too.
inbound_area = px.scatter(trend_area, x = 'area', y = 'inbound', color = 'region', opacity=0.8, trendline='ols', title = 'The Correlation between Inbound Tourism and Area', labels = {'inbound': 'Inbound Tourism', 'area': 'Area'}).update_traces(marker=dict(size=5, line=dict(width=1, color='DarkSlateGrey')), selector=dict(mode='markers')).update_layout(legend_title = 'Region')
inbound_area
inbound_area_europe = trend_area[trend_area['region'] == 'Europe'].copy()
inbound_area_europe
| name | region | country_alpha_3 | inbound | Country Code | area | |
|---|---|---|---|---|---|---|
| 0 | France | Europe | FRA | 217877000.0 | FRA | 547557.000 |
| 3 | Spain | Europe | ESP | 126170000.0 | ESP | 499570.036 |
| 5 | Italy | Europe | ITA | 95399000.0 | ITA | 295717.000 |
| 6 | Poland | Europe | POL | 88515000.0 | POL | 306110.000 |
| 7 | Hungary | Europe | HUN | 61397000.0 | HUN | 91260.000 |
| 8 | Croatia | Europe | HRV | 60021000.0 | HRV | 55964.000 |
| 11 | United Kingdom | Europe | GBR | 40857000.0 | GBR | 241930.000 |
| 13 | Germany | Europe | DEU | 39563000.0 | DEU | 349390.000 |
| 14 | Czech Republic | Europe | CZE | 37202000.0 | CZE | 77205.500 |
| 16 | Greece | Europe | GRC | 34005000.0 | GRC | 128900.000 |
| 17 | Denmark | Europe | DNK | 33093000.0 | DNK | 40000.000 |
| 19 | Austria | Europe | AUT | 31884000.0 | AUT | 82520.000 |
| 24 | Netherlands | Europe | NLD | 20129000.0 | NLD | 33670.000 |
| 29 | Portugal | Europe | PRT | 17283000.0 | PRT | 91605.600 |
| 31 | Slovakia | Europe | SVK | 16086000.0 | SVK | 48080.000 |
| 33 | Ukraine | Europe | UKR | 13710000.0 | UKR | 579400.000 |
| 36 | Romania | Europe | ROU | 12815000.0 | ROU | 230080.000 |
| 37 | Bulgaria | Europe | BGR | 12552000.0 | BGR | 108560.000 |
| 38 | Ireland | Europe | IRL | 12401000.0 | IRL | 68890.000 |
| 39 | Belarus | Europe | BLR | 11832000.0 | BLR | 202965.000 |
| 40 | Switzerland | Europe | CHE | 11818000.0 | CHE | 39516.030 |
| 45 | Belgium | Europe | BEL | 9343000.0 | BEL | 30280.000 |
| 51 | Latvia | Europe | LVA | 8342000.0 | LVA | 62227.000 |
| 52 | Andorra | Europe | AND | 8235000.0 | AND | 470.000 |
| 54 | Sweden | Europe | SWE | 7616000.0 | SWE | 407283.590 |
| 59 | Albania | Europe | ALB | 6406000.0 | ALB | 27400.000 |
| 61 | Lithuania | Europe | LTU | 6150000.0 | LTU | 62620.000 |
| 62 | Estonia | Europe | EST | 6103000.0 | EST | 42750.000 |
| 63 | Norway | Europe | NOR | 5879000.0 | NOR | 365094.000 |
| 70 | Slovenia | Europe | SVN | 4702000.0 | SVN | 20136.400 |
| 80 | Malta | Europe | MLT | 3519000.0 | MLT | 320.000 |
| 84 | Finland | Europe | FIN | 3290000.0 | FIN | 303930.000 |
| 88 | Montenegro | Europe | MNE | 2510000.0 | MNE | 13450.000 |
| 95 | Iceland | Europe | ISL | 2202000.0 | ISL | 100830.000 |
| 103 | San Marino | Europe | SMR | 1904000.0 | SMR | 60.000 |
| 105 | Republic of Serbia | Europe | SRB | 1847000.0 | SRB | 87460.000 |
| 122 | Bosnia and Herzegovina | Europe | BIH | 1198000.0 | BIH | 51200.000 |
| 126 | Luxembourg | Europe | LUX | 1041000.0 | LUX | 2574.460 |
| 142 | Macedonia | Europe | MKD | 758000.0 | MKD | 25220.000 |
| 146 | Monaco | Europe | MCO | 545000.0 | MCO | 2.027 |
| 170 | Moldova | Europe | MDA | 174000.0 | MDA | 32885.900 |
| 175 | Liechtenstein | Europe | LIE | 98100.0 | LIE | 160.000 |
sns.residplot(x = inbound_area_europe['area'], y = inbound_area_europe['inbound'])
plt.show()
train_idx, test_idx = train_test_split(inbound_area_europe.index, test_size=.25, random_state=0)
inbound_area_europe['split'] = 'train'
inbound_area_europe.loc[test_idx, 'split'] = 'test'
X = inbound_area_europe[['area']]
X_train = inbound_area_europe.loc[train_idx, ['area']]
y_train = inbound_area_europe.loc[train_idx, 'inbound']
model = LinearRegression()
model.fit(X_train, y_train)
inbound_area_europe['prediction'] = model.predict(X)
inbound_area_europe['residual'] = inbound_area_europe['prediction'] - inbound_area_europe['inbound']
inbound_area_europe_residplot = px.scatter(
inbound_area_europe, x='prediction', y='residual',
marginal_y='violin',
color='split', trendline='ols'
)
inbound_area_europe_residplot
europe_area_coef, europe_area_value = stats.pearsonr(trend_area['area'], trend_area['inbound'])
europe_area_coef, europe_area_value
(0.127905459662025, 0.0802483560283787)
outbound_map
outbound_vbar
outbound_hbar
As we can see, United States of America has the highest number of outbound tourism, followed by Mexico and Germany.
inbound_map
inbound_vbar
inbound_hbar
As we can see, France has the highest number of inbound tourism, followed by United States of America and China.
domestic_map
domestic_vbar
domestic_hbar
As we can see, China has the highest number of domestic tourism, followed by United States of America and India.
In conclusion, United States of America, France, and India have the highest number of outbound, inbound, and domestic tourism respectively.
arrivals_map
arrivals_bar
arrivals_stacked_bar
In conclusion, China is the most visited country overall.
outbound_tree
Outbound Tourism:
Africa: Namibia
Americas: United States of America
Asia: China
Europe: Germany
Oceania: Australia
inbound_tree
Inbound Tourism:
Africa: Namibia
Americas: United States of America
Asia: China
Europe: France
Oceania: Australia
domestic_tree
Domestic Tourism:
Africa: South Africa
Americas: United States of America
Asia: China
Europe: United Kingdom
Oceania: Australia
arrivals_tree
Arrivals:
Africa: South Africa
Americas: United States of America
Asia: China
Europe: United Kingdom
Oceania: Australia
total_tree
total_group
total_box
Total:
Africa: Namibia
Americas: United States of America
Asia: China
Europe: United Kingdom
Oceania: Australia
Outbound Tourism: Africa: Namibia, Americas: United States of America, Asia: China, Europe: Germany, Oceania: Australia
Inbound Tourism: Africa: Namibia, Americas: United States of America, Asia: China, Europe: France, Oceania: Australia
Domestic Tourism: Africa: South Africa, Americas: United States of America, Asia: China, Europe: United Kingdom, Oceania: Australia
Arrivals: Africa: South Africa, Americas: United States of America, Asia: China, Europe: United Kingdom, Oceania: Australia
Total: Africa: Namibia, Americas: United States of America, Asia: China, Europe: United Kingdom, Oceania: Australia
total_bar
total_stacked_vbar
total_stacked_hbar
In conclusion, China has the most tourism overall.
outbound_growth
Let us focus on United States of America.
outbound_growth.update_traces({'line': {'color': 'lightgrey'}}).update_traces(patch = {'line': {'color': 'blue', 'width': 2}}, selector={'legendgroup': 'United States of America'}).add_annotation(ax = 2005, ay = 79215000 ,axref = "x", ayref='y', x = 2006, y = 148511000, text = 'Surge', showarrow=True, xshift = -30, yshift = 15, arrowhead = 4, arrowwidth = 2, font_size = 20)
We can see a sudden surge in outbound tourism for United States of America in 2006.
inbound_growth
Let us focus on France.
inbound_growth.update_traces({'line': {'color': 'lightgrey'}}).update_traces(patch = {'line': {'color': 'blue', 'width': 2}}, selector={'legendgroup': 'France'}).add_annotation(ax = 1997, ay = 157551000 ,axref = "x", ayref='y', x = 1998, y = 70109000, text = 'Drop', showarrow=True, xshift = 10, yshift = 15, arrowhead = 4, arrowwidth = 2, font_size = 20).add_annotation(ax = 2003, ay = 75048000 ,axref = "x", ayref='y', x = 2004, y = 190282000, text = 'Surge', showarrow=True, xshift = -30, yshift = 15, arrowhead = 4, arrowwidth = 2, font_size = 20)
We can see that other than a drop from 1998 to 2003, the inbound tourism of France is consistently higher than other countries.
domestic_growth
Let us focus on China.
domestic_growth.update_traces({'line': {'color': 'lightgrey'}}).update_traces(patch = {'line': {'color': 'blue', 'width': 2}}, selector={'legendgroup': 'China'}).add_annotation(ax = 1997, ay = 644000000 ,axref = "x", ayref='y', x = 2019, y = 6005852000, text = 'Steady Increase', showarrow=True, xshift = -30, yshift = 15, arrowhead = 4, arrowwidth = 2, font_size = 20)
We can see that the domestic tourism of China increases steadily and overtakes United States of America.
Here is a pairplot of the ranking of countries based on different aspects against the ranking of inbound tourism.
inbound_ranking
From the pairplot, we can see that Heritage and Cultural Influence have the most correlation with Inbound Tourism.
Firstly, let us take a closer look at heritage.
inbound_heritage_contour
The most concentrated area with the most number of countries is where heritage is 9.5 and inbound tourism is 24.5.
inbound_heritage.add_annotation(x = 25, y = 120, text = 'Correlation Coefficient = ' + str(heritage_coef) + '<br>' + 'P-Value = ' + str(heritage_value), showarrow = False, font_size = 20)
Since the correlation coefficient is close to 1, there is a positive relationship between the ranking of inbound tourism and ranking of heritage. Moreover, the p-value is significantly less than 0.001, which shows that there is a strong certainty in the result.
Moving on to cultural influence.
inbound_culture_contour
The most concentrated area with the most number of countries is where cultural influence is 9.5 and inbound tourism is 24.5.
inbound_culture.add_annotation(x = 25, y = 110, text = 'Correlation Coefficient = ' + str(culture_coef) + '<br>' + 'P-Value = ' + str(culture_value), showarrow = False, font_size = 20)
As the correlation coefficient is not very close to 1, there is a slight positive relationship between the ranking of inbound tourism and ranking of cultural influence. Moreover, the p-value is less than 0.001, which shows that there is a strong certainty in the result.
Here is a small multiples plot of inbound tourism and interest over time.
interest
There is no clear correlation.
Let us look at France, for instance.
france_interest_over_time
Inbound tourism is increasing slightly while interest over time is decreasing. Inbound tourism reaches its peak at the end while interest over time achieves it at the start.
Next, let us look at Albania.
albania_interest_over_time
Inbound tourism is increasing slightly while interest over time is increasing more rapidly. Both inbound tourism and interest over time reach their peak at the end.
Finally, let us look at China.
china_interest_over_time
Inbound tourism is increasing while interest over time is decreasing. Inbound tourism reaches its peak at the end while interest over time achieves it at the start.
It is clear that there is no significant correlation between inbound tourism and interest over time.
A scatterplot of inbound tourism against area is plotted with the points being grouped by region.
inbound_area.add_annotation(x = 600000, y = 100000000, text = 'Europe Correlation Coefficient = ' + str(europe_area_coef) + '<br>' + 'Europe P-Value = ' + str(europe_area_value), showarrow = False, font_size = 10)
Regression lines are plotted for each region as there is no clear trend overall.
Let us focus on Europe, for example.
inbound_area_europe_residplot
We can see that the residual plot has a fan shape. Hence, the prediction of inbound tourism gets worse as area increases.
In conclusion, heritage and cultural influence have decent correlation with inbound tourism, while interest over time is unrelated to inbound tourism, and area has weak and inaccurate correlation.
It would be useful to analyse tourism data of where people depart and arrive. However, more data will need to be collected in this area.
It would also be useful to record the purpose of tourism, so that the data analysis can focus more on entertainment.
For future works, we can analyse how COVID-19 affected tourism and the progress of recovery for various countries.
Furthermore, more correlations can be examined and researched to determine whether causation exists.